Based on another intuition, the VGG authors doubled the depth of the feature maps for each block of convolutions (from 64 after the first convolution to 512). As each set is followed by a max-pooling layer with a 2 × 2 window size and a stride of 2, the depth doubles while the spatial dimensions are halved.
This allows the encoding of spatial information into more and more complex and discriminative features for classification.