Image captioning means generating a caption given an image. In this section, we will first learn about the preprocessing to be done to build an LSTM that can generate a text caption given an image, and then will learn how to combine a CNN and LSTM to perform image captioning. Before we learn about building a system that generates captions, let's understand how a sample input and output might look:
In the preceding example, the image is the input and the expected output is the caption of the image – In this image I can see few candles. The background is in black color.
The strategy that we will adopt to solve this problem is as follows:
- Preprocess the output (ground truth annotations/captions) so that each unique word is represented by a unique ID.
- Given that the output sentences can be of any length, let's assign a start and end token so that the model knows when to stop generating predictions. Furthermore, ensure that all input sentences...