Let's visualize our model to better understand what we just built. You will notice that the number of activation maps (denoted by the depth of subsequent layer outputs) progressively increases throughout the network. On the other hand, the length and width of the activation maps tend to decrease, from (64 x 64) to (16 x 16), by the time the dropout layer is reached. These two patterns are conventional in most, if not all, modern iterations of CNNs.
The reason behind the variance in input and output dimensions between layers can depend on how you have chosen to address the border effects we discussed earlier, or what stride you have implemented for the filters in your convolutional layer. Smaller strides will lead to higher dimensions, whereas larger strides will lead to lower dimensions. This is simply to do with the number of locations you are computing...