Captions generated for test images
With the help of metrics such as accuracy and BLEU, we have ensured our model is performing well. But, one of the most important tasks a trained model has to perform is generating outputs for new data. We will learn how we can use our model to generate actual captions. Let’s first understand how we can generate captions at a conceptual level. It’s quite straightforward to generate the image representation using an image. The tricky part is adapting the text decoder to generate captions. As you can imagine, the decoder inference needs to work in a different setting than the training. This is because at inference we don’t have caption tokens to input to the model.
The way we predict with our model is by starting with the image and a starting caption that has the single token [START]
. We feed these two inputs to the model to generate the next token. We then combine the new token with the current input and predict the next token...