Implementing a reusable image caption feature extractor
The first step of creating an image captioning, deep learning-based solution is to transform the data into a format that can be used by certain networks. This means we must encode images as vectors, or tensors, and the text as embeddings, which are vectorial representations of sentences.
In this recipe, we will implement a customizable and reusable component that will allow us to preprocess the data we'll need to implement an image captioner beforehand, thus saving us tons of time later on in the process.
Let's begin!
Getting ready
The dependencies we need are tqdm
(to display a nice progress bar) and Pillow
(to load and manipulate images using TensorFlow's built-in functions):
$> pip install Pillow tqdm
We will use the Flickr8k
dataset, which is available on Kaggle: https://www.kaggle.com/adityajn105/flickr8k. Log in or sign up, download it, and decompress it in a directory of your choosing...