We spent the better part of this chapter touting the advantages of the attention mechanism. But we still use attention in the context of RNNs—in that sense, it works as an addition on top of the core recurrent nature of these models. Since attention is so good, is there a way to use it on its own without the RNN part? It turns out that there is. The paper Attention is all you need (https://arxiv.org/abs/1706.03762) introduces a new architecture called transformer with encoder and decoder that relies solely on the attention mechanism. First, we'll focus our attention on the transformer attention (pun intended) mechanism.
Understanding transformers
The transformer attention
Before focusing on the entire model, let...