Building transformers with attention
We’ve spent the better part of this chapter touting the advantages of the attention mechanism. It’s time to reveal the full transformer architecture, which, unlike RNNs, relies solely on the attention mechanism (Attention Is All You Need, https://arxiv.org/abs/1706.03762). The following diagram shows two of the most popular transformer flavors, post-ln and pre-ln (or post-normalization and pre-normalization):
Figure 7.9 – Left: the original (post-normalization, post-ln) transformer; right: pre-normalization (pre-ln) transformer (inspired by https://arxiv.org/abs/1706.03762)
It looks scary, but fret not—it’s easier than it seems. In this section, we’ll discuss the transformer in the context of the seq2seq task, which we defined in the Introducing seq2seq models section. That is, it will take a sequence of tokens as input, and it will output another, different, token sequence...