Getting to know the data
Let's first understand the data we are working with both directly and indirectly. There are two datasets we will rely on:
The ILSVRC ImageNet dataset (http://image-net.org/download)
The MS-COCO dataset (http://cocodataset.org/#download)
We will not engage the first dataset directly, but it is essential for caption learning. This dataset contains images and their respective class labels (for example, cat, dog, and car). We will use a CNN that is already trained on this dataset, so we do not have to download and train on this dataset from scratch. Next we will use the MS-COCO dataset, which contains images and their respective captions. We will directly learn from this dataset by mapping the image to a fixed-size feature vector, using the CNN, and then map this vector to the corresponding caption using an LSTM (we will discuss the process in detail later).
ILSVRC ImageNet dataset
ImageNet is an image dataset that contains a large set of images (~1 million) and their respective...