The sequence-to-sequence architecture is based on a paper called sequence to sequence—Video to Text authored by Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. The paper can be located at https://arxiv.org/pdf/1505.00487.pdf.
In the following diagram (Figure 5.3), a sequence-to-sequence video-captioning neural network architecture based on the preceding paper is illustrated:
The sequence-to-sequence model processes the video image frames through a pre-trained convolutional neural network as before and the output activations of the last fully connected layer are taken as the features to be fed to the LSTMs that follow. If we denote the output activations of the last fully connected layer of the pre...