Exploring the Transformer’s architecture
The Transformer architecture was proposed as an alternative to RNNs for sequence-to-sequence tasks. It heavily relies on the self-attention mechanism to process both input and output sequences.
We’ll start by looking at the high-level architecture of the Transformer model (image based on that in the paper Attention Is All You Need, by Vaswani et al.):
Figure 13.1: Transformer architecture
As you can see, the Transformer consists of two parts: the encoder (the big rectangle on the left-hand side) and the decoder (the big rectangle on the right-hand side). The encoder encrypts the input sequence. It has a multi-head attention layer and a regular feedforward layer. On the other hand, the decoder generates the output sequence. It has a masked multi-head attention (we will talk about this in detail later) layer, along with a multi-head attention layer and a regular feedforward layer.
At step t, the Transformer...