Improving performance and state-of-the-art models
Let's first talk through some simple experiments you can try to improve performance before talking about the latest models. Recall our discussion on positional encodings for inputs in the Encoder. Adding or removing positional encodings helps or hinders performance. In the previous chapter, we implemented the beam search algorithm for generating summaries. You can adapt the beam search code and see an improvement in the results with beam search. Another avenue of exploration is the ResNet50. We used a pre-trained network and did not fine-tune it further. It is possible to build an architecture where ResNet is part of the architecture and not a pre-processing step. Image files are loaded in, and features are extracted from ResNet50 as part of the VisualEncoder. ResNet50 layers can be trained from the get-go, or only in the last few iterations. This idea is implemented in the resnet-finetuning.py
file for you to try. Another line...