Chapter 7: Captioning Images with CNNs and RNNs
Equipping neural networks with the ability to describe visual scenes in a human-readable fashion has to be one of the most interesting yet challenging applications of deep learning. The main difficulty arises from the fact that this problem combines two major subfields of artificial intelligence: Computer Vision (CV) and Natural Language Processing (NLP).
The architectures of most image captioning networks use a Convolutional Neural Network (CNN) to encode images in a numeric format so that they're suitable for the consumption of the decoder, which is typically a Recurrent Neural Network (RNN). This is a kind of network specialized in learning from sequential data, such as time series, video, and text.
As we'll see in this chapter, the challenges of building a system with these capabilities start with preparing the data, which we'll cover in the first recipe. Then, we'll implement an image captioning solution...