Training the NMT
Now that we have defined the NMT architecture and preprocessed training data, it is quite straightforward to train the model. Here we will define and illustrate (see Figure 10.10) the exact process used for training:
Preprocess
as explained previously
-
Feed xs into the
and calculate v conditioned on xs
-
Initialize
with v
-
Predict
corresponding to the input sentence xs from
, where the mth prediction, out of the target vocabulary V is calculated as follows:
Here, wTm denotes the best target word for mth position.
-
Calculate the loss: categorical cross-entropy between the predicted word,
, and the actual word at the
position,
-
Optimize both the
,
, and softmax layer with respect to the loss
Figure 10.10: The training procedure for the NMT