Advancing language understanding with the Transformer model
The Transformer
model was first proposed in Attention Is All You Need (https://arxiv.org/abs/1706.03762). It can effectively handle long-term dependencies, which are still challenging in LSTM. In this section, we will go through the Transformer's architecture and building blocks, as well as its most crucial part: the self-attention layer.
Exploring the Transformer's architecture
We'll start by looking at the high-level architecture of the Transformer model (image taken from Attention Is All You Need):
Figure 13.15: Transformer architecture
As you can see, the Transformer consists of two parts: the encoder (the big rectangle on the left-hand side) and the decoder (the big rectangle on the right-hand side). The encoder encrypts the input sequence. It has a multi-head attention layer (we will talk about this next) and a regular feedforward layer. On the other hand, the decoder...