Generally, in the convolution operation several different kernels are applied that result in generation of several feature maps. Thus, the convolution operation results in generating a large sized dataset.
As an example, applying a kernel of shape 3 x 3 x 1 to an MNIST dataset that has images of shape 28 x 28 x 1 pixels, produces a feature map of shape 26 x 26 x 1. If we apply 32 such filters in a convolutional layer, then the output will be of shape 32 x 26 x 26 x 1, that is, 32 feature maps of shape 26 x 26 x 1.
This is a huge dataset as compared to the original dataset of shape 28 x 28 x 1. Thus, to simplify the learning for the next layer, we apply the concept of pooling.
Pooling refers to calculating the aggregate statistic over the regions of the convolved feature space. Two most popular aggregate statistics are the maximum and the average. The output...