Implementing an image captioning network on COCO with attention
A great way to understand how an image captioning network generates its descriptions is by adding an attention component to the architecture. This lets us appreciate what parts of the photo a network was looking at when it generated each word.
In this recipe, we'll train an end-to-end image captioning system on the more challenging Common Objects in Context (COCO) dataset. We'll also equip our network with an attention mechanism to improve its performance and to help us understand its inner reasoning.
This is a long and advanced recipe, but don't panic! We'll go step by step. If you want to dive deeper into the theory that supports this implementation, take a look at the See also section.
Getting ready
Although we'll be using the COCO
dataset, you don't need to do anything beforehand, because we'll download it as part of the recipe (however, you can read more about this seminal...