The first step for our model is to leverage a pretrained DCNN model, using principles of transfer learning to extract the right features from our source images. To keep things simple, we will not be fine-tuning or connecting the VGG-16 model to the rest of our model architecture. We will be extracting the bottleneck features from all our images beforehand to speed up training later, since building a sequence model with several LSTMs will take a lot of training time even on GPUs, as we will see shortly.
To get started, we will load up all the source image filenames and their corresponding captions from the Flickr8k_text folder in the source dataset. Also we will combine the dev and train dataset images together, as we mentioned before:
import pandas as pd import numpy as np # read train image file names with open('../Flickr8k_text...