MS-COCO dataset for image captioning
Microsoft published the Common Objects in Context or COCO dataset in 2014. All the versions of the dataset can be found at The COCO dataset is a big dataset that's used for object detection, segmentation, and captioning, among other annotations. Our focus will be on the 2014 training and validation images, where five captions per image are available. There are roughly 83K images in the training set and 41K images in the validation set. The training and validation images and captions need to be downloaded from the COCO website.
Large download warning: The training image dataset is approximately 13 GB, while the validation dataset is over 6 GB. The annotations for the image files, which include captions, are about 214 MB in size. Please be careful of your internet bandwidth usage and potential costs as you download this dataset.
Google has also published a new Conceptual Captions dataset at