Implementing an image captioning network
An image captioning architecture is comprised of an encoder and a decoder. The encoder is a CNN (typically a pre-trained one), which converts input images into numeric vectors. These vectors are then passed, along with text sequences, to the decoder, which is an RNN, that will learn, based on these values, how to iteratively generate each word in the corresponding caption.
In this recipe, we'll implement an image captioner that's been trained on the Flickr8k
dataset. We'll leverage the feature extractor we implemented in the Implementing a reusable image caption feature extractor recipe.
Let's begin, shall we?
Getting ready
The external dependencies we'll be using in this recipe are Pillow
, nltk
, and tqdm
. You can install them all at once with the following command:
$> pip install Pillow nltk tqdm
We will use the Flickr8k
dataset, which you can get from Kaggle: https://www.kaggle.com/adityajn105...