Convolutions and max layers
A great improvement in image classification has been achieved with the invention of the convolutional layers on the MNIST database:
While previous fully-connected layers perform a computation with all input values (pixels in the case of an image) of the input, a 2D convolution layer will consider only a small patch or window or receptive field of NxN pixels of the 2D input image for each output unit. The dimensions of the patch are named kernel dimensions, N is the kernel size, and the coefficients/parameters are the kernel.
At each position of the input image, the kernel produces a scalar, and all position values will lead to a matrix (2D tensor) called a feature map. Convolving the kernel on the input image as a sliding window creates a new output image. The stride of the kernel defines the number of pixels to shift the patch/window over the image: with a stride of 2, the convolution with the kernel is computed every 2 pixels.
For example, on a 224 x 224 input...