The architecture of BERT
BERT introduces bidirectional attention to transformer models. Bidirectional attention requires many other changes to the original Transformer model.
We will not go through the building blocks of transformers described in Chapter 2, Getting Started with the Architecture of the Transformer Model. You can consult Chapter 2 at any time to review an aspect of the building blocks of transformers. In this section, we will focus on the specific aspects of BERT models.
We will focus on the evolutions designed by Devlin et al. (2018), which describe the encoder stack. We will first go through the encoder stack, then the preparation of the pretraining input environment. Then we will describe the two-step framework of BERT: pretraining and fine-tuning.
Let’s first explore the encoder stack.
The encoder stack
The first building block we will take from the original Transformer model is an encoder layer. The encoder layer, as described in Chapter...