Now, let's train the model. The first thing we need to do is to extract the features stored in the respective .npy files and then pass those features through the CNN encoder.
The encoder output, hidden state (initialized to 0) and the decoder input (which is the start token) are passed to the decoder. The decoder returns the predictions and the decoder hidden state.
The decoder hidden state is then passed back into the model and the predictions are used to calculate the loss. While training, we use the teacher forcing technique to decide the next input to the decoder.
Teacher forcing is the technique where the target word is passed as the next input to the decoder. This technique helps to learn the correct sequence or correct statistical properties for the sequence, quickly.
The final step is to calculate the gradient and apply it to the optimizer...