In their paper (Very Deep Convolutional Networks for Large-Scale Image Recognition, ArXiv, 2014), Simonyan and Zisserman presented how they developed their network to be deeper than most previous ones. They actually introduced six different CNN architectures, from 11 to 25 layers deep. Each network is composed of five blocks of several consecutive convolutions followed by a max-pooling layer and three final dense layers (with dropout for training). All the convolutional and max-pooling layers have SAME for padding. The convolutions have s = 1 for stride, and are using the ReLU function for activation. All in all, a typical VGG network is represented in the following diagram:
The two most performant architectures, still commonly used nowadays, are called VGG-16 and VGG-19. The numbers (16 and 19) represent the depth of these CNN architectures; that is, the number of trainable layers stacked together. For example, as shown in...