Intuition and justification for CNNs
The information we extract from sensory inputs is often determined by their context. With images, we can assume that nearby pixels are closely related, and their collective information is more relevant when taken as a unit. Conversely, we can assume that individual pixels don’t convey information related to each other. For example, to recognize letters or digits, we need to analyze the dependency of pixels close by because they determine the shape of the element. In this way, we could figure out the difference between, say, a 0 or a 1. The pixels in an image are organized in a two-dimensional grid, and if the image isn’t grayscale, we’ll have a third dimension for the color channels.
Alternatively, a magnetic resonance image (MRI) also uses three-dimensional space. You might recall that, until now, if we wanted to feed an image to an NN, we had to reshape it from a two-dimensional array into a one-dimensional array. CNNs...