The Architecture and Scale of Transformers
A hint about hardware-driven design appears in the The architecture of multi-head attention section of Chapter 2, Getting Started with the Architecture of the Transformer Model:
“However, we would only get one point of view at a time by analyzing the sequence with one dmodel block. Furthermore, it would take quite some calculation time to find other perspectives.
A better way is to divide the dmodel = 512 dimensions of each word xn of x (all the words of a sequence) into 8 dk = 64 dimensions.
We then can run the 8 “heads” in parallel to speed up the training and obtain 8 different representation subspaces of how each word relates to another:
Figure II.1: Multi-head representations
You can see that there are now 8 heads running in parallel.
We can easily see the motivation for forcing the attention heads to learn 8 different perspectives. However, digging deeper into the motivations of the...