Approaches for image captioning and related problems
Several approaches have been suggested for captioning images. Intuitively, the images are converted to visual features and text is generated from the features. The text generated will be in the form of word embedding. Some of the predominant approaches for generating text involve LSTM and attention. Let's begin with an approach that uses an old way of generating text.
Using a condition random field for linking image and text
Kulkarni et al., in the paper http://www.tamaraberg.com/papers/generation_cvpr11.pdf, proposed a method of finding the objects and attributes from an image and using it to generate text with a conditional random field (CRF). The CRF is traditionally used for a structured prediction such as text generation. The flow of generating text is shown here:
Figure illustrating the process of text generation using CRF [Reproduced from Kulkarni et al.]
The use of CRF has limitations in generating text in a coherent manner with proper...