Let's take a look at the data we will be using to build our model. To keep things simple, we will be using the Flickr8K dataset. This dataset includes images obtained from Flickr, a popular image sharing website. To download the dataset, you can request it by filling in a form at https://forms.illinois.edu/sec/1713398 from the Department of Computer Science, University of Illinois, and you should get the download link in your email.
To check out the details pertaining to each image, you can refer to their website, http://nlp.cs.illinois.edu/HockenmaierGroup/8k-pictures.html, which talks about each image, its source, and five text-based captions for each image. In general, any sample image would have several captions similar to the following:
You can clearly see the image and its corresponding captions. It is quite evident that all the captions try...