Attention heads
A self-attention sublayer is divided into n independent and identical layers called heads. For example, the original Transformer contains eight heads.
Figure I.3 represents heads as processors to show that transformers’ industrialized structure fits hardware design:
Figure I.3: A self-attention sublayer contains heads
Note that the attention heads are represented by microprocessors in Figure I.3 to stress the parallel processing power of transformer architectures.
Transformer architectures fit both NLP and hardware-optimization requirements.
Join our book’s Discord space
Join the book’s Discord workspace for a monthly Ask me Anything session with the authors: