In this section, we put all the pieces together to build the function for training the video-captioning model.
First, we create the word vocabulary dictionary, combining the video captions from the training and test datasets. Once this is done, we invoke the build_model function to create the video-captioning network, combining the two LSTMs. For each video with a specific start and end, there are multiple output video captions. Within each batch, the output video caption for a video with a specific start and end is randomly selected from the multiple video captions available. The input text captions to the LSTM 2 are adjusted to have the starting word at the time step (N+1) as <bos>, while the end word of the output text captions are adjusted to have the final text label as <eos>. The sum of the categorical cross entropy loss over each of the time...