Image captioning
Image captioning is all about describing the contents of an image in a sentence. Captions can help in content-based image retrieval and visual search. We already discussed how captions could improve the accessibility of websites by making it easier for screen readers to summarize the content of an image. A caption can be considered a summary of the image. Once we frame the problem as an image summarization problem, we can adapt the seq2seq model from the previous chapter to solve this problem. In text summarization, the input is a sequence of the long-form article, and the output is a short sequence summarizing the content. In image captioning, the output is similar in format to summarization. However, it may not be obvious how to structure an image that consists of pixels as a sequence of embeddings to be fed into the Encoder.
Secondly, the summarization architecture used Bi-directional Long Short-Term Memory networks (BiLSTMs), with the underlying principle...