Why GPUs are so special
A clue to GPU-driven design emerges in the The architecture of multi-head attention section of Chapter 2, Getting Started with the Architecture of the Transformer Model.
Attention is defined as “Scaled Dot-Product Attention,” which is represented in the following equation into which we plug Q, K, and V:
We can now conclude the following:
- Attention heads are designed for parallel computing
- Attention heads are based on matmul, matrix multiplication