Training the Transformer model with VisualEncoder
Training the Transformer model can take hours as we want to train for around 20 epochs. It is best to put the training code into a file so that it can be run from the command line. Note that the model will be able to show some results even after 4 epochs of training. The training code is in the caption-training.py
file. At a high level, the following steps need to be performed before starting training. First, the CSV file with captions and image names is loaded in, and the corresponding paths for the files with extracted image features are appended. The Subword Encoder is also loaded in. A tf.data.Dataset
is created with the encoded captions and image features for easy batching and feeding them into the model for training. A loss function, an optimizer with a learning rate schedule, is created for use in training. A custom training loop is used to train the Transformer model. Let's go over these steps in detail...