Towards a deep learning approach
While playing with handwritten digit recognition, we came to the conclusion that the closer we get to the accuracy of 99%, the more difficult it is to improve. If we want more improvement, we definitely need a new idea. What are we missing? Think about it.
The fundamental intuition is that in our examples so far, we are not making use of the local spatial structure of images. In particular, this piece of code transforms the bitmap representing each written digit into a flat vector where the local spatial structure (the fact that some pixels are closer to each other) is gone:
# X_train is 60000 rows of 28x28 values; we --> reshape it as in # 60000 x 784.
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
However, this is not how our brain works. Remember that our vision is based on multiple cortex levels, each one recognizing more and more structured information, still preserving the locality. First, we see single pixels, then from those, we recognize simple geometric forms, and then more and more sophisticated elements such as objects, faces, human bodies, animals, and so on.
In Chapter 4, Convolutional Neural Networks we will see that a particular type of deep learning network, known as a Convolutional Neural Network (in short, CNN) has been developed by taking into account both the idea of preserving the local spatial structure in images (and more generally, in any type of information that has a spatial structure) and the idea of learning via progressive levels of abstraction: with one layer you can only learn simple patterns, with more than one layer you can learn multiple patterns. Before discussing CNNs, we need to discuss some aspects of TensorFlow architecture and have a practical introduction to a few additional machine learning concepts. This will be the topic of the upcoming chapters.