The researchers of BERT have presented the model in two standard configurations:
- BERT-base
- BERT-large
Let's take a look at each of these in detail.
BERT-base
BERT-base consists of 12 encoder layers, each stacked one on top of the other. All the encoders use 12 attention heads. The feedforward network in the encoder consists of 768 hidden units. Thus, the size of the representation obtained from BERT-base will be 768.
We use the following notations:
- The number of encoder layers is denoted by .
- The attention head is denoted by .
- The hidden unit is denoted by .
Thus, in the BERT-base model, we have, , , and . The total number of parameters in BERT-base is 110 million. The BERT-base model is shown in the following diagram:
BERT-large
BERT-large consists of 24 encoder layers, each stacked one on top of the other. All the encoders use 16 attention heads. The feedforward network in the encoder consists of 1,024 hidden units....