Getting to know the data
Let’s first understand the data we are working with both directly and indirectly. There are two datasets we will rely on:
- The ILSVRC ImageNet dataset (http://image-net.org/download)
- The MS-COCO dataset (http://cocodataset.org/#download)
We will not engage the first dataset directly, but it is essential for caption learning. This dataset contains images and their respective class labels (for example, cat, dog, and car). We will use a CNN that is already trained on this dataset, so we do not have to download and train on this dataset from scratch. Next we will use the MS-COCO dataset, which contains images and their respective captions. We will directly learn from this dataset by mapping the image to a fixed-size feature vector, using the Vision Transformer, and then map this vector to the corresponding caption using a text-based Transformer (we will discuss this process in detail later).
ILSVRC ImageNet dataset
ImageNet...