The purpose of convolution operation in images is to preserve spatial information and features. The size and stride of the filter determines how well you want to preserve the spatial information. For example, a 3x3 filter with stride one will contain more finer information than 3x3 with stride 2. Similarly, a 3x3 filter contains more spatially correlated information than a 5x5 filter.
Now the purpose of strided convolutions or max pooling after convolutions is downsizing the image and reducing the size of input for the neural network to reduce computations while preserving information.
Now coming to how to choose number of filters. As you have observed the size of image halves, but number of filters gets doubled. It is a general thumb rule, not a hardcoded rule. You can experiment by changing number of filters.
But let’s say I have 40x40x128 matrix and I must downsize it to 20x20, so what is the best way to preserve the information? To preserve the information, we increase the number of filters to 256. Even in this operation you will observe that for next step we have only half the computations as compared to previous step. If you can display the feature maps as heatmaps you can observe the features corresponding to an object in the original image.