The machine learning pipeline for image caption generation
Here we will look at the image caption generation pipeline at a very high level and then discuss it piece by piece until we have the full model. The image caption generation framework consists of two main components:
- A pretrained Vision Transformer model to produce an image representation
- A text-based decoder model that can decode the image representation to a series of token IDs. This uses a text tokenizer to convert tokens to token IDs and vice versa
Though the Transformer models were initially used for text-based NLP problems, they have out-grown the domain of text data and have been used in other areas such as image data and audio data.
Here we will be using one Transformer model that can process image data and another that can process text data.
Vision Transformer (ViT)
First, let’s look at the Transformer generating the encoded vector representations of images. We will be...