Transformers’ architectures
In this section, we have provided a high-level overview of both the most important architectures used by transformers and of the different ways used to compute attention.
Categories of transformers
In this section, we are going to classify transformers into different categories. The next paragraph will introduce the most common transformers.
Decoder or autoregressive
A typical example is a GPT (Generative Pre-Trained) model, which you can learn more about in the GPT-2 and GPT-3 sections later in this chapter, or refer to https://openai.com/blog/language-unsupervised). Autoregressive models use only the decoder of the original transformer model, with the attention heads that can only see what is before in the text and not after with a masking mechanism used on the full sentence. Autoregressive models use pretraining to guess the next token after observing all the previous ones. Typically, autoregressive models are used for Natural...