We can train the transformer network by minimizing the loss function. Okay, but what loss function should we use? We learned that the decoder predicts the probability distribution over the vocabulary and we select the word that has the highest probability as output. So, we have to minimize the difference between the predicted probability distribution and the actual probability distribution. First, how can we find the difference between the two distributions? We can use cross-entropy for that. Thus, we can define our loss function as a cross-entropy loss and try to minimize the difference between the predicted and actual probability distribution. We train the network by minimizing the loss function and we use Adam as an optimizer.
One additional point we need to note down is that to prevent overfitting, we apply dropout to the output of each sublayer and we also apply dropout to the sum of the embeddings and the positional encoding.
Thus, in this chapter, we learned...