Training the NMT
Now that we have defined the NMT architecture and preprocessed training data, it is quite straightforward to train the model. Here we will define and illustrate (see Figure 10.10) the exact process used for training:
Preprocess
as explained previously
- Feed xs into the and calculate v conditioned on xs
- Initialize with v
-
Predict corresponding to the input sentence xs from , where the mth prediction, out of the target vocabulary V is calculated as follows:
Here, wTm denotes the best target word for mth position.
- Calculate the loss: categorical cross-entropy between the predicted word, , and the actual word at the position,
-
Optimize both the , , and softmax layer with respect to the loss