LLM architecture
In Chapter 7, we introduced the multi-head attention (MHA) mechanism and the three major transformer variants—encoder-decoder, encoder-only, and decoder-only (we used BERT and GPT as prototypical encoder and decoder models). In this section, we’ll discuss various bits and pieces of the LLM architecture. Let’s start by focusing our attention (yes—it’s the same old joke) on the attention mechanism.
LLM attention variants
The attention we discussed so far is known as global attention. The following diagram displays the connectivity matrix of a bidirectional global self-attention mechanism (context window with size n=8):
Figure 8.1 – Global self-attention with a context window with size n=8
Each row and column represent the full input token sequence, . The dotted colored diagonal cells represent the current input token (query), . The uninterrupted colored cells of each column represent all tokens...