The authors began with a simple observation—a stack of two convolutions with 3 × 3 kernels has the same receptive field as a convolution with 5 × 5 kernels (refer to Chapter 3, Modern Neural Networks, for the effective receptive field (ERF) formula).
Similarly, three consecutive 3 × 3 convolutions result in a 7 × 7 receptive field, and five 3 × 3 operations result in an 11 × 11 receptive field. Therefore, while AlexNet has large filters (up to 11 × 11), the VGG network contains more numerous but smaller convolutions for a larger ERF. The benefits of this change are twofold:
- It decreases the number of parameters: Indeed, the N filters of an 11 × 11 convolution layer imply 11 × 11 × D × N = 121DN values to train just for their kernels (for an input of depth D), while five 3 × 3 convolutions have a total of 1 ×...