Attention
The LSTM-based architecture we used to prepare our first language model for text generation had one major limitation. The RNN layer (generally speaking, it could be LSTM, or GRU, etc.) takes in a context window of a defined size as input and encodes all of it into a single vector. This bottleneck vector needs to capture a lot of information in itself before the decoding stage can use it to start generating the next token.
Attention is one of the most powerful concepts in the deep learning space that really changed the game. The core idea behind the attention mechanism is to make use of all interim hidden states of the RNN to decide which one to focus upon before it is used by the decoding stage. A more formal way of presenting attention is:
Given a vector of values (all the hidden states of the RNN) and a query vector (this could be the decoder state), attention is a technique to compute a weighted sum of the values, dependent on the query.
The weighted sum acts...