The machine learning pipeline for image caption generation
Here we will look at the image caption generation pipeline at a very high level and then discuss it piece by piece until we have the full model. The image caption generation framework consists of three main components and one optional component:
A CNN generating encoded vectors for images
An embedding layer learning word vectors
(Optional) An adaptation layer that can transform a given embedding dimensionality to an arbitrary dimensionality (details will be discussed later)
An LSTM taking the encoded vectors of the images, and outputting the corresponding caption
First, let's look at the CNN generating the encoded vectors for images. We can achieve this by first training a CNN on a large classification dataset, such as ImageNet, and using that knowledge to generate compressed vectorized representations of images.
One might ask, why not input the image as it is to the LSTM? Let's go back to a simple calculation we did in the previous chapter...