The architecture of BERT
BERT introduces bidirectional attention to transformer models. Bidirectional attention requires many other changes to the original Transformer model.
We will not go through the building blocks of transformers described in Chapter 1, Getting Started with the Model Architecture of the Transformer. You can consult Chapter 1 at any time to review an aspect of the building blocks of transformers. In this section, we will focus on the specific aspects of BERT models.
We will focus on the evolutions designed by Devlin et al. (2018), which describe the encoder stack.
We will first go through the encoder stack, then the preparation of the pretraining input environment. Then we will describe the two-step framework of BERT: pretraining and fine-tuning.
Let's first explore the encoder stack.
The encoder stack
The first building block we will take from the original Transformer model is an encoder layer. The encoder layer as described in Chapter 1, Getting Started with the Model Architecture of the Transformer, is shown in Figure 2.1:
Figure 2.1: The encoder layer
The BERT model does not use decoder layers. A BERT model has an encoder stack but no decoder stacks. The masked tokens (hiding the tokens to predict) are in the attention layers of the encoder, as we will see when we zoom into a BERT encoder layer in the following sections.
The original Transformer contains a stack of N=6 layers. The number of dimensions of the original Transformer is dmodel = 512. The number of attention heads of the original Transformer is A=8. The dimensions of a head of the original Transformer is:
BERT encoder layers are larger than the original Transformer model.
Two BERT models can be built with the encoder layers:
- BERTBASE, which contains a stack of N=12 encoder layers. dmodel = 768 and can also be expressed as H=768, as in the BERT paper. A multi-head attention sub-layer contains A=12 heads. The dimensions of each head zA remains 64 as in the original Transformer model:
The output of each multi-head attention sub-layer before concatenation will be the output of the 12 heads:
output_multi-head_attention={z0, z1, z2,…,z11}
- BERTLARGE, which contains a stack of N=24 encoder layers. dmodel = 1024. A multi-head attention sub-layer contains A=16 heads. The dimensions of each head zA also remains 64 as in the original Transformer model:
The output of each multi-head attention sub-layer before concatenation will be the output of the 16 heads:
output_multi-head_attention={z0, z1, z2,…,z15}
The sizes of the models can be summed up as follows:
Figure 2.2: Transformer models
Size and dimensions play an essential role in BERT-style pretraining. BERT models are like humans. BERT models produce better results with more working memory (dimensions), and more knowledge (data). Large transformer models that learn large amounts of data will pretrain better for downstream NLP tasks.
Let's now go to the first sub-layer and see the fundamental aspects of input embedding and positional encoding in a BERT model.
Preparing the pretraining input environment
The BERT model has no decoder stack of layers. As such, it does not have a masked multi-head attention sub-layer. BERT goes further and states that a masked multi-head attention layer that masks the rest of the sequence impedes the attention process.
A masked multi-head attention layer masks all of the tokens that are beyond the present position. For example, take the following sentence:
The cat sat on it because it was a nice rug.
If we have just reached the word "it," the input of the encoder could be:
The cat sat on it<masked sequence>
The motivation of this approach is to prevent the model from seeing the output it is supposed to predict. This left-to-right approach produces relatively good results.
However, the model cannot learn much this way. To know what "it"
refers to, we need to see the whole sentence to reach the word "rug"
and figure out that "it"
was the rug.
The authors of BERT came up with an idea. Why not pretrain the model to make predictions using a different approach?
The authors of BERT came up with bidirectional attention, letting an attention head attend to all of the words both from left to right and right to left. In other words, the self-attention mask of an encoder could do the job without being hindered by the masked multi-head attention sub-layer of the decoder.
The model was trained with two tasks. The first method is Masked Language Modeling (MLM). The second method is Next Sentence Prediction (NSP).
Let's start with masked language modeling.
Masked language modeling
Masked language modeling does not require training a model with a sequence of visible words followed by a masked sequence to predict.
BERT introduces the bidirectional analysis of a sentence with a random mask on a word of the sentence.
It is important to note that BERT applies WordPiece
, a sub-word segmentation method, tokenization to the inputs. It also uses learned positional encoding, not the sine-cosine approach.
A potential input sequence could be:
"The cat sat on it because it was a nice rug."
The decoder would mask the attention sequence after the model reached the word "it"
:
"The cat sat on it <masked sequence>."
But the BERT encoder masks a random token to make a prediction:
"The cat sat on it [MASK] it was a nice rug."
The multi-attention sub-layer can now see the whole sequence, run the self-attention process, and predict the masked token.
The input tokens were masked in a tricky way to force the model to train longer but produce better results with three methods:
- Surprise the model by not masking a single token on 10% of the dataset; for example:
"The cat sat on it [because] it was a nice rug."
- Surprise the model by replacing the token with a random token on 10% of the dataset; for example:
"The cat sat on it [often] it was a nice rug."
- Replace a token by a
[MASK]
token on 80% of the dataset; for example:"The cat sat on it [MASK] it was a nice rug."
The authors' bold approach avoids overfitting and forces the model to train efficiently.
BERT was also trained to perform next sentence prediction.
Next sentence prediction
The second method found to train BERT is Next Sentence Prediction (NSP). The input contains two sentences.
Two new tokens were added:
[CLS]
is a binary classification token added to the beginning of the first sequence to predict if the second sequence follows the first sequence. A positive sample is usually a pair of consecutive sentences taken from a dataset. A negative sample is created using sequences from different documents.[SEP]
is a separation token that signals the end of a sequence.
For example, the input sentences taken out of a book could be:
"The cat slept on the rug. It likes sleeping all day."
These two sentences would become one input complete sequence:
[CLS] the cat slept on the rug [SEP] it likes sleep ##ing all day[SEP]
This approach requires additional encoding information to distinguish sequence A from sequence B.
If we put the whole embedding process together, we obtain:
Figure 2.3: Input embeddings
The input embeddings are obtained by summing the token embeddings, the segment (sentence, phrase, word) embeddings, and the positional encoding embeddings.
The input embedding and positional encoding sub-layer of a BERT model can be summed up as follows:
- A sequence of words is broken down into
WordPiece
tokens. - A
[MASK]
token will randomly replace the initial word tokens for masked language modeling training. - A
[CLS]
classification token is inserted at the beginning of a sequence for classification purposes. - A
[SEP]
token separates two sentences (segments, phrases) for NSP training. - Sentence embedding is added to token embedding, so that sentence A has a different sentence embedding value than sentence B.
- Positional encoding is learned. The sine-cosine positional encoding method of the original Transformer is not applied.
Some additional key features are:
- BERT uses bidirectional attention in all of its multi-head attention sub-layers, opening vast horizons of learning and understanding relationships between tokens.
- BERT introduces scenarios of unsupervised embedding, pretraining models with unlabeled text. This forces the model to think harder during the multi-head attention learning process. This makes BERT able to learn how languages are built and apply this knowledge to downstream tasks without having to pretrain each time.
- BERT also uses supervised learning, covering all bases in the pretraining process.
BERT has improved the training environment of transformers. Let's now see the motivation of pretraining and how it helps the fine-tuning process.
Pretraining and fine-tuning a BERT model
BERT is a two-step framework. The first step is the pretraining, and the second is fine-tuning, as shown in Figure 2.4:
Figure 2.4: The BERT framework
Training a transformer model can take hours, if not days. It takes quite some time to engineer the architecture and parameters, and select the proper datasets to train a transformer model.
Pretraining is the first step of the BERT framework that can be broken down into two sub-steps:
- Defining the model's architecture: number of layers, number of heads, dimensions, and the other building blocks of the model
- Training the model on Masked Language Modeling (MLM) and NSP tasks
The second step of the BERT framework is fine-tuning, which can also be broken down into two sub-steps:
- Initializing the downstream model chosen with the trained parameters of the pretrained BERT model
- Fine-tuning the parameters for specific downstream tasks such as Recognizing Textual Entailment (RTE), Question Answering (
SQuAD v1.1
,SQuAD v2.0
), and Situations With Adversarial Generations (SWAG)
In this section, we covered the information we need to fine-tune a BERT model. In the following chapters, we will explore the topics we brought up in this section in more depth:
- In Chapter 3, Pretraining a RoBERTa Model from Scratch, we will pretrain a BERT-like model from scratch in 15 steps. We will even compile our own data, train a tokenizer, and then train the model. The goal of this chapter is to first go through the specific building blocks of BERT and then fine-tune an existing model.
- In Chapter 4, Downstream NLP Tasks with Transformers, we will go through many downstream NLP tasks, exploring
GLUE
,SQuAD v1.1
,SQuAD
,SWAG
,BLEU
, and several other NLP evaluation datasets. We will run several downstream transformer models to illustrate key tasks. The goal of this chapter is to fine-tune a downstream model. - In Chapter 6, Text Generation with OpenAI GPT-2 and GPT-3 Models, we will explore the architecture and usage of Open AI
GPT
,GPT-2
, andGPT-3
transformers. BERTBASE was configured to be close to OpenAIGPT
to show that it produced better performance. However, OpenAI transformers keep evolving too! We will see how.
In this chapter, the BERT model we will fine-tune will be trained on The Corpus of Linguistic Acceptability (CoLA). The downstream task is based on Neural Network Acceptability Judgments by Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman.
We will fine-tune a BERT model that will determine the grammatical acceptability of a sentence. The fine-tuned model will have acquired a certain level of linguistic competence.
We have gone through BERT architecture and its pretraining and fine-tuning framework. Let's now fine-tune a BERT model.