Attention mechanism
In the previous section we saw how the context or thought vector from the last time step of the encoder is fed into the decoder as the initial hidden state. As the context flows through the time steps on the decoder, the signal gets combined with the decoder output and progressively gets weaker and weaker. The result is that the context does not have much effect towards the later time steps on the decoder.
In addition, certain sections of the decoder output may depend more heavily on certain sections of the input. For example, consider an input "thank you very much", and the corresponding output "merci beaucoup" for an English to French translation network such as the one we looked at in the previous section. Here the English phrases "thank you", and "very much", correspond to the French "merci" and "beaucoup" respectively. This information is also not conveyed adequately through the single context vector...