We need CNNs because neural networks do not scale well to image data. In Chapter 4, Computer Vision for Self-Driving Cars, we discussed how images are stored. When we build a simple image classifier, it takes color images with a size of 64 x 64 (height x width).
So, the input size for the neural network will be .
Therefore, our input layer will have 12,288 weights. If we use an image with a size of  , we will have 49,152 weights. If we add hidden layers, we will see that it will exponentially increase in training time. The CNN doesn't actually reduce the weights in the input layer. It finds a representation internally in hidden layers to basically take advantage of how images are formed. This way, we can actually make our neural network much more effective at dealing with image data.
In the next section, we will read about the intuition behind these neural networks.