With the advanced progress in both computer vision and NLP areas, more and more researchers are starting to look at potential applications in their intersect areas.
One type of application is called image captioning, or im2text, which is for automatically generating descriptions for a given image. It requires the joint use of technologies in both computer vision and NLP. For a given image, the goal is to analyze its visual content and generate a realistic textual description to describe the major content or most salient aspect of the image. For example, the human in a picture.
To achieve this goal, the caption generation model has to have at least two capabilities:
- Understand the visual cues
- Be able to generate realistic natural language
Understanding the visual cues can be very task-specific; that is, the focus can be different in different scenarios. This...