Training the model
Now that the data pipeline and the model are defined, training it is quite easy. First let’s define a few parameters:
n_vocab = 4000
batch_size=96
train_fraction = 0.6
valid_fraction = 0.2
We use a vocabulary size of 4,000 and a batch size of 96. To speed up the training we’ll only use 60% of training data and 20% of validation data. However, you could increase these to get better results. Then we get the tokenizer trained on the full training dataset:
tokenizer = generate_tokenizer(
train_captions_df, n_vocab=n_vocab
)
Next we define the BLEU metric. This is the same BLEU computation from Chapter 9, Sequence-to-Sequence Learning – Neural Machine Translation, with some minor differences. Therefore, we will not repeat the discussion here.
bleu_metric = BLEUMetric(tokenizer=tokenizer)
Sample the smaller set of validation data outside the training loop to keep the set constant:
sampled_validation_captions_df ...