We saw that each layer has a depth that denoted the number of activation maps. These are also referred to as channels, where each channel contains an activation map, with a height and width of (n x n). Our first layer, for example, has 16 different maps of size 64 x 64. Similarly, the fourth layer has 16 activation maps of size 32 x 32. The eighth layer has 32 activation maps, each of size 16 x 16. Each of these activation maps was generated by a specific filter from its respective layer, and are passed forward to subsequent layers to encode higher-level features. This will concur with our smile detector model's architectural build, which we can always verify, as shown here:
