Attention mechanism
In the previous section, we saw how the context or thought vector from the last time step of the encoder is fed into the decoder as the initial hidden state. As the context flows through the time steps on the decoder, the signal gets combined with the decoder output and progressively gets weaker and weaker. The result is that the context does not have much effect on the later time steps in the decoder.
In addition, certain sections of the decoder output may depend more heavily on certain sections of the input. For example, consider an input “thank you very much,” and the corresponding output “merci beaucoup” for an English-to-French translation network such as the one we looked at in the previous section. Here, the English phrases “thank you,” and “very much,” correspond to the French “merci” and “beaucoup” respectively. This information is also not conveyed adequately through...