We build the video-captioning system by training the model on the MSVD dataset—a pre captioned YouTube video repository from Microsoft. The required data can be downloaded from the following link: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/YouTubeClips.tar. The text captions for the videos are available at the following link: https://github.com/jazzsaxmafia/video_to_sequence/files/387979/video_corpus.csv.zip.
There are around 1,938 videos in the MSVD dataset. We will use these to train the sequence-to- sequence video-captioning system. Also note that we would be building the model on the sequence to sequence model illustrated in Figure 5.3. However readers are advised to try and train a model on the architecture presented in Figure 5.4 and see how it fares.