In this section, the core model-building exercise is illustrated. We first define a embedding layer for words in the vocabulary of the text captions followed by the two LSTMs. The weights self.encode_W and self.encode_b are used to reduce the dimension of the features ft from the convolutional neural network. For the second LSTM (LSTM 2), one of the other inputs at any time step t > N is the previous word wt-1 along with the output ht from the first LSTM (LSTM 1). The word embedding for wt-1 is fed to the LSTM 2 instead of the raw one hot encoded vector. For the first N (self.video_lstm_step), the LSTM 1 processes the input features ft from the CNN, and the output hidden state ht(output1) is fed to the LSTM 2. During this encoding phase, the LSTM 2 doesn't receive any word wt-1 as an input.
From the (N+1) time step, we enter the decoding phase, where...