While the classic VGG architecture ends with several fully connected (FC) layers (such as AlexNet), the authors suggest an alternative version. In this version, the dense layers are replaced by convolutional ones.
The first set of convolutions with larger kernels (7 × 7 and 3 × 3) reduces the spatial size of the feature maps to 1 × 1 (with no padding applied beforehand) and increases their depth to 4,096. Finally, a 1 × 1 convolution is used with as many filters as classes to predict from (that is, N = 1,000 for ImageNet). The resulting 1 × 1 × N vector is normalized with the softmax function, and then flattened into the final class predictions (with each value of the vector representing the predicted class probability).
1 × 1 convolutions are commonly used to change the depth of the input volume without affecting its spatial structure. For each spatial position, the new values are interpolated...