Delving into self-attention
Let’s consider again the ancient art of hieroglyphs, where symbols were chosen intentionally to convey complex messages. Self-attention operates in a similar manner, determining which parts of a sequence are vital and should be emphasized.
Illustrated in Figure 11.6 is the beauty of integrating self-attention within sequential models. Think of the bottom layer, churning with bidirectional RNNs, as the foundational stones of a pyramid. They generate what we call the context vector (c2), a summary, much like a hieroglyph would for an event.
Each step or word in a sequence has its weightage, symbolized as α. These weights interact with the context vector, emphasizing the importance of certain elements over others.
Imagine a scenario wherein the input Xk represents a distinct sentence, denoted as k, which spans a length of L1. This can be mathematically articulated as:
Here, every element, , represents a word or token from...