Transformers for Natural Language Processing

Fine-Tuning BERT Models

In Chapter 1, Getting Started with the Model Architecture of the Transformer, we defined the building blocks of the architecture of the original Transformer. Think of the original Transformer as a model built with LEGO^® bricks. The construction set contains bricks such as encoders, decoders, embedding layers, positional encoding methods, multi-head attention layers, masked multi-head attention layers, post-layer normalization, feed-forward sub-layers, and linear output layers. The bricks come in various sizes and forms. You can spend hours building all sorts of models using the same building kit! Some constructions will only require some of the bricks. Other constructions will add a new piece, just like when we obtain additional bricks for a model built using LEGO^® components.

BERT added a new piece to the Transformer building kit: a bidirectional multi-head attention sub-layer. When we humans are having problems understanding a sentence, we do not just look at the past words. BERT, like us, looks at all the words in the same sentence at the same time.

In this chapter, we will first explore the architecture of Bidirectional Encoder Representations from Transformers (BERT). BERT only uses the blocks of the encoders of the Transformer in a novel way and does not use the decoder stack.

Then we will fine-tune a pretrained BERT model. The BERT model we will fine-tune was trained by a third party and uploaded to Hugging Face. Transformers can be pretrained. Then, a pretrained BERT, for example, can be fine-tuned on several NLP tasks. We will go through this fascinating experience of downstream Transformer usage using Hugging Face modules.

This chapter covers the following topics:

Bidirectional Encoder Representations from Transformers (BERT)
The architecture of BERT
The two-step BERT framework
Preparing the pretraining environment
Defining pretraining encoder layers
Defining fine-tuning
Downstream multitasking
Building a fine-tuning BERT model
Loading an accessibility judgement dataset
Creating attention masks
BERT model configuration
Measuring the performance of the fine-tuned model

Our first step will be to explore the background of the Transformer.

The architecture of BERT

BERT introduces bidirectional attention to transformer models. Bidirectional attention requires many other changes to the original Transformer model.

We will not go through the building blocks of transformers described in Chapter 1, Getting Started with the Model Architecture of the Transformer. You can consult Chapter 1 at any time to review an aspect of the building blocks of transformers. In this section, we will focus on the specific aspects of BERT models.

We will focus on the evolutions designed by Devlin et al. (2018), which describe the encoder stack.

We will first go through the encoder stack, then the preparation of the pretraining input environment. Then we will describe the two-step framework of BERT: pretraining and fine-tuning.

Let's first explore the encoder stack.

The encoder stack

The first building block we will take from the original Transformer model is an encoder layer. The encoder layer as described in Chapter 1, Getting Started with the Model Architecture of the Transformer, is shown in Figure 2.1:

Figure 2.1: The encoder layer

The BERT model does not use decoder layers. A BERT model has an encoder stack but no decoder stacks. The masked tokens (hiding the tokens to predict) are in the attention layers of the encoder, as we will see when we zoom into a BERT encoder layer in the following sections.

The original Transformer contains a stack of N=6 layers. The number of dimensions of the original Transformer is d_model = 512. The number of attention heads of the original Transformer is A=8. The dimensions of a head of the original Transformer is:

BERT encoder layers are larger than the original Transformer model.

Two BERT models can be built with the encoder layers:

BERT_BASE, which contains a stack of N=12 encoder layers. d_model = 768 and can also be expressed as H=768, as in the BERT paper. A multi-head attention sub-layer contains A=12 heads. The dimensions of each head z_A remains 64 as in the original Transformer model:

The output of each multi-head attention sub-layer before concatenation will be the output of the 12 heads:

output_multi-head_attention={z₀, z₁, z₂,…,z₁₁}

BERT_LARGE, which contains a stack of N=24 encoder layers. d_model = 1024. A multi-head attention sub-layer contains A=16 heads. The dimensions of each head z_A also remains 64 as in the original Transformer model:

The output of each multi-head attention sub-layer before concatenation will be the output of the 16 heads:

output_multi-head_attention={z₀, z₁, z₂,…,z₁₅}

The sizes of the models can be summed up as follows:

Figure 2.2: Transformer models

Size and dimensions play an essential role in BERT-style pretraining. BERT models are like humans. BERT models produce better results with more working memory (dimensions), and more knowledge (data). Large transformer models that learn large amounts of data will pretrain better for downstream NLP tasks.

Let's now go to the first sub-layer and see the fundamental aspects of input embedding and positional encoding in a BERT model.

Preparing the pretraining input environment

The BERT model has no decoder stack of layers. As such, it does not have a masked multi-head attention sub-layer. BERT goes further and states that a masked multi-head attention layer that masks the rest of the sequence impedes the attention process.

A masked multi-head attention layer masks all of the tokens that are beyond the present position. For example, take the following sentence:

The cat sat on it because it was a nice rug.

If we have just reached the word "it," the input of the encoder could be:

The cat sat on it<masked sequence>

The motivation of this approach is to prevent the model from seeing the output it is supposed to predict. This left-to-right approach produces relatively good results.

However, the model cannot learn much this way. To know what "it" refers to, we need to see the whole sentence to reach the word "rug" and figure out that "it" was the rug.

The authors of BERT came up with an idea. Why not pretrain the model to make predictions using a different approach?

The authors of BERT came up with bidirectional attention, letting an attention head attend to all of the words both from left to right and right to left. In other words, the self-attention mask of an encoder could do the job without being hindered by the masked multi-head attention sub-layer of the decoder.

The model was trained with two tasks. The first method is Masked Language Modeling (MLM). The second method is Next Sentence Prediction (NSP).

Let's start with masked language modeling.

Masked language modeling

Masked language modeling does not require training a model with a sequence of visible words followed by a masked sequence to predict.

BERT introduces the bidirectional analysis of a sentence with a random mask on a word of the sentence.

It is important to note that BERT applies WordPiece, a sub-word segmentation method, tokenization to the inputs. It also uses learned positional encoding, not the sine-cosine approach.

A potential input sequence could be:

"The cat sat on it because it was a nice rug."

The decoder would mask the attention sequence after the model reached the word "it":

"The cat sat on it <masked sequence>."

But the BERT encoder masks a random token to make a prediction:

"The cat sat on it [MASK] it was a nice rug."

The multi-attention sub-layer can now see the whole sequence, run the self-attention process, and predict the masked token.

The input tokens were masked in a tricky way to force the model to train longer but produce better results with three methods:

Surprise the model by not masking a single token on 10% of the dataset; for example:
```
"The cat sat on it [because] it was a nice rug."
```

Surprise the model by replacing the token with a random token on 10% of the dataset; for example:
```
"The cat sat on it [often] it was a nice rug."
```

Replace a token by a [MASK] token on 80% of the dataset; for example:
```
"The cat sat on it [MASK] it was a nice rug."
```

The authors' bold approach avoids overfitting and forces the model to train efficiently.

BERT was also trained to perform next sentence prediction.

Next sentence prediction

The second method found to train BERT is Next Sentence Prediction (NSP). The input contains two sentences.

Two new tokens were added:

[CLS] is a binary classification token added to the beginning of the first sequence to predict if the second sequence follows the first sequence. A positive sample is usually a pair of consecutive sentences taken from a dataset. A negative sample is created using sequences from different documents.
[SEP] is a separation token that signals the end of a sequence.

For example, the input sentences taken out of a book could be:

"The cat slept on the rug. It likes sleeping all day."

These two sentences would become one input complete sequence:

[CLS] the cat slept on the rug [SEP] it likes sleep ##ing all day[SEP]

This approach requires additional encoding information to distinguish sequence A from sequence B.

If we put the whole embedding process together, we obtain:

Figure 2.3: Input embeddings

The input embeddings are obtained by summing the token embeddings, the segment (sentence, phrase, word) embeddings, and the positional encoding embeddings.

The input embedding and positional encoding sub-layer of a BERT model can be summed up as follows:

A sequence of words is broken down into WordPiece tokens.
A [MASK] token will randomly replace the initial word tokens for masked language modeling training.
A [CLS] classification token is inserted at the beginning of a sequence for classification purposes.
A [SEP] token separates two sentences (segments, phrases) for NSP training.
Sentence embedding is added to token embedding, so that sentence A has a different sentence embedding value than sentence B.
Positional encoding is learned. The sine-cosine positional encoding method of the original Transformer is not applied.

Some additional key features are:

BERT uses bidirectional attention in all of its multi-head attention sub-layers, opening vast horizons of learning and understanding relationships between tokens.
BERT introduces scenarios of unsupervised embedding, pretraining models with unlabeled text. This forces the model to think harder during the multi-head attention learning process. This makes BERT able to learn how languages are built and apply this knowledge to downstream tasks without having to pretrain each time.
BERT also uses supervised learning, covering all bases in the pretraining process.

BERT has improved the training environment of transformers. Let's now see the motivation of pretraining and how it helps the fine-tuning process.

Pretraining and fine-tuning a BERT model

BERT is a two-step framework. The first step is the pretraining, and the second is fine-tuning, as shown in Figure 2.4:

Figure 2.4: The BERT framework

Training a transformer model can take hours, if not days. It takes quite some time to engineer the architecture and parameters, and select the proper datasets to train a transformer model.

Pretraining is the first step of the BERT framework that can be broken down into two sub-steps:

Defining the model's architecture: number of layers, number of heads, dimensions, and the other building blocks of the model
Training the model on Masked Language Modeling (MLM) and NSP tasks

The second step of the BERT framework is fine-tuning, which can also be broken down into two sub-steps:

Initializing the downstream model chosen with the trained parameters of the pretrained BERT model
Fine-tuning the parameters for specific downstream tasks such as Recognizing Textual Entailment (RTE), Question Answering (SQuAD v1.1, SQuAD v2.0), and Situations With Adversarial Generations (SWAG)

In this section, we covered the information we need to fine-tune a BERT model. In the following chapters, we will explore the topics we brought up in this section in more depth:

In Chapter 3, Pretraining a RoBERTa Model from Scratch, we will pretrain a BERT-like model from scratch in 15 steps. We will even compile our own data, train a tokenizer, and then train the model. The goal of this chapter is to first go through the specific building blocks of BERT and then fine-tune an existing model.
In Chapter 4, Downstream NLP Tasks with Transformers, we will go through many downstream NLP tasks, exploring GLUE, SQuAD v1.1, SQuAD, SWAG, BLEU, and several other NLP evaluation datasets. We will run several downstream transformer models to illustrate key tasks. The goal of this chapter is to fine-tune a downstream model.
In Chapter 6, Text Generation with OpenAI GPT-2 and GPT-3 Models, we will explore the architecture and usage of Open AI GPT, GPT-2, and GPT-3 transformers. BERT_BASE was configured to be close to OpenAI GPT to show that it produced better performance. However, OpenAI transformers keep evolving too! We will see how.

In this chapter, the BERT model we will fine-tune will be trained on The Corpus of Linguistic Acceptability (CoLA). The downstream task is based on Neural Network Acceptability Judgments by Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman.

We will fine-tune a BERT model that will determine the grammatical acceptability of a sentence. The fine-tuned model will have acquired a certain level of linguistic competence.

We have gone through BERT architecture and its pretraining and fine-tuning framework. Let's now fine-tune a BERT model.

Fine-tuning BERT

In this section, we will fine-tune a BERT model to predict the downstream task of Acceptability Judgements and measure the predictions with the Matthews Correlation Coefficient (MCC), which will be explained in the Evaluating using Matthews Correlation Coefficient section of this chapter.

Open BERT_Fine_Tuning_Sentence_Classification_DR.ipynb in Google Colab (make sure you have an email account). The notebook is in Chapter02 of the GitHub repository of this book.

The title of each cell in the notebook is also the same, or very close to the title of each subsection of this chapter.

Let's start making sure the GPU is activated.

Activating the GPU

Pretraining a multi-head attention transformer model requires the parallel processing GPUs can provide.

The program first starts by checking if the GPU is activated:

#@title Activating the GPU
# Main menu->Runtime->Change Runtime Type
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

The output should be:

Found GPU at: /device:GPU:0

The program will be using Hugging Face modules.

Installing the Hugging Face PyTorch interface for BERT

Hugging Face provides a pretrained BERT model. Hugging Face developed a base class named PreTrainedModel. By installing this class, we can load a model from a pretrained model configuration.

Hugging Face provides modules in TensorFlow and PyTorch. I recommend that a developer feels comfortable with both environments. Excellent AI research teams use either or both environments.

In this chapter, we will install the modules required as follows:

#@title Installing the Hugging Face PyTorch Interface for Bert
!pip install -q transformers

The installation will run, or requirement satisfied messages will be displayed.

We can now import the modules needed for the program.

Importing the modules

We will import the pretrained modules required, such as the pretrained BERT tokenizer and the configuration of the BERT model. The BERTAdam optimizer is imported along with the sequence classification module:

#@title Importing the modules
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertConfig
from transformers import AdamW, BertForSequenceClassification, get_linear_schedule_with_warmup

A nice progress bar module is imported from tqdm:

from tqdm import tqdm, trange

We can now import the widely used standard Python modules:

import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt

No message will be displayed if all goes well, bearing in mind that Google Colab has pre-installed the modules on the VM we are using.

Specifying CUDA as the device for torch

We will now specify that torch uses the Compute Unified Device Architecture (CUDA) to put the parallel computing power of the NVIDIA card to work for our multi-head attention model:

#@title Specify CUDA as device for Torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

The VM I ran on Google Colab displayed the following output:

'Tesla P100-PCIE-16GB'

The output may vary with Google Colab configurations.

We will now load the dataset.

Loading the dataset

We will now load the CoLA based on the Warstadt et al. (2018) paper.

General Language Understanding Evaluation (GLUE) considers Linguistic Acceptability as a top-priority NLP task. In Chapter 4, Downstream NLP Tasks with Transformers, we will explore the key tasks transformers must perform to prove their efficiency.

Use the Google Colab file manager to upload in_domain_train.tsv and out_of_domain_dev.tsv, which you will find on GitHub in the Chapter02 directory of the repository of the book.

You should see them appear in the file manager:

Figure 2.5: Uploading the datasets

Now the program will load the datasets:

#@title Loading the Dataset
#source of dataset : https://nyu-mll.github.io/CoLA/
df = pd.read_csv("in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
df.shape

The output displays the shape of the dataset we have imported:

(8551, 4)

A 10-line sample is displayed to visualize the Acceptability Judgment task and see if a sequence makes sense or not:

df.sample(10)

The output shows 10 lines of the labeled dataset:

sentence_source	label	label_notes	sentence
1742	r-67		1	NaN		they said that tom would n't pay up , but pay ...
937	bc01		1	NaN		although he likes cabbage too , fred likes egg...
5655	c_13		1	NaN		wendy 's mother country is iceland .
500	bc01		0	*		john is wanted to win .
4596	ks08		1	NaN		i did n't find any bugs in my bed .
7412	sks13		1	NaN		the girl he met at the departmental party will...
8456	ad03		0	*		peter is the old pigs .
744	bc01		0	*		frank promised the men all to leave .
5420	b_73		0	*		i 've seen as much of a coward as frank .
5749	c_13		1	NaN		we drove all the way to buenos aires .

Each sample in the .tsv files contains four tab-separated columns:

Column 1: the source of the sentence (code)
Column 2: the label (0=unacceptable, 1=acceptable)
Column 3: the label annotated by the author
Column 4: the sentence to be classified

You can open the .tsv files locally to read a few samples of the dataset. The program will now process the data for the BERT model.

Creating sentences, label lists, and adding BERT tokens

The program will now create the sentences as described in the Preparing the pretraining input environment section of this chapter:

#@ Creating sentence, label lists and adding Bert tokens
sentences = df.sentence.values
# Adding CLS and SEP tokens at the beginning and end of each sentence for BERT
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]
labels = df.label.values

The [CLS] and [SEP] have now been added.

The program now activates the tokenizer.

Activating the BERT tokenizer

In this section, we will initialize a pretrained BERT tokenizer. This will save the time it would take to train it from scratch.

The program selects an uncased tokenizer, activates it, and displays the first tokenized sentence:

#@title Activating the BERT Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print ("Tokenize the first sentence:")
print (tokenized_texts[0])

The output contains the classification token and the sequence segmentation token:

Tokenize the first sentence:
['[CLS]', 'our', 'friends', 'wo', 'n', "'", 't', 'buy', 'this', 'analysis', ',', 'let', 'alone', 'the', 'next', 'one', 'we', 'propose', '.', '[SEP]']

The program will now process the data.

Processing the data

We need to determine a fixed maximum length and process the data for the model. The sentences in the datasets are short. But, to make sure of this, the program sets the maximum length of a sequence to 512 and the sequences are padded:

#@title Processing the data
# Set the maximum sequence length. The longest sequence in our training set is 47, but we'll leave room on the end anyway. 
# In the original paper, the authors used a length of 512.
MAX_LEN = 128
# Use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
# Pad our input tokens
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

The sequences have been processed and now the program creates the attention masks.

Creating attention masks

Now comes a tricky part of the process. We padded the sequences in the previous cell. But we want to prevent the model from performing attention on those padded tokens!

The idea is to apply a mask with a value of 1 for each token, which will be followed by 0s for padding:

#@title Create attention masks
attention_masks = []
# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
  seq_mask = [float(i>0) for i in seq]
  attention_masks.append(seq_mask)

The program will now split the data.

Splitting data into training and validation sets

The program now performs the standard process of splitting the data into training and validation sets:

#@title Splitting data into train and validation sets
# Use train_test_split to split our data into train and validation sets for training
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels, random_state=2018, test_size=0.1)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, input_ids,random_state=2018, test_size=0.1)

The data is ready to be trained, but it still needs to be adapted to torch.

Converting all the data into torch tensors

The fine-tuning model uses torch tensors. The program must convert the data into torch tensors:

#@title Converting all the data into torch tensors
# Torch tensors are the required datatype for our model
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

The conversion is over. Now we need to create an iterator.

Selecting a batch size and creating an iterator

In this cell, the program selects a batch size and creates an iterator. The iterator is a clever way of avoiding a loop that would load all the data in memory. The iterator, coupled with the torch DataLoader, can batch train huge datasets without crashing the memory of the machine.

In this model, the batch size is 32:

#@title Selecting a Batch Size and Creating and Iterator
# Select a batch size for training. For fine-tuning BERT on a specific task, the authors recommend a batch size of 16 or 32
batch_size = 32
# Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop, 
# with an iterator the entire dataset does not need to be loaded into memory
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

The data has been processed and is all set. The program can now load and configure the BERT model.

BERT model configuration

The program now initializes a BERT uncased configuration:

#@title BERT Model Configuration
# Initializing a BERT bert-base-uncased style configuration
#@title Transformer Installation
try:
  import transformers
except:
  print("Installing transformers")
  !pip -qq install transformers
  
from transformers import BertModel, BertConfig
configuration = BertConfig()
# Initializing a model from the bert-base-uncased style configuration
model = BertModel(configuration)
# Accessing the model configuration
configuration = model.config
print(configuration)

The output displays the main Hugging Face parameters similar to the following (the library is often updated):

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

Let's go through these main parameters:

attention_probs_dropout_prob: 0.1 applies a 0.1 dropout ratio to the attention probabilities.
hidden_act: "gelu" is a non-linear activation function in the encoder. It is a Gaussian Error Linear Units activation function. The input is weighted by its magnitude, which makes it non-linear.
hidden_dropout_prob: 0.1 is the dropout probability applied to the fully connected layers. Full connections can be found in the embeddings, encoder, and pooler layers. The pooler is there to convert the sequence tensor for classification tasks, which require a fixed dimension to represent the sequence. The pooler will thus convert the sequence tensor to (batch size, hidden size), which are fixed parameters.
hidden_size: 768 is the dimension of the encoded layers and also the pooler layer.
initializer_range: 0.02 is the standard deviation value when initializing the weight matrices.
intermediate_size: 3072 is the dimension of the feed-forward layer of the encoder.
layer_norm_eps: 1e-12 is the epsilon value for layer normalization layers.
max_position_embeddings: 512 is the maximum length the model uses.
model_type: "bert" is the name of the model.
num_attention_heads: 12 is the number of heads.
num_hidden_layers: 12 is the number of layers.
pad_token_id: 0 is the ID of the padding token to avoid training padding tokens.
type_vocab_size: 2 is the size of the token_type_ids, which identify the sequences. For example, "the dog[SEP] The cat.[SEP]" can be represented with 6 token IDs: [0,0,0, 1,1,1].
vocab_size: 30522 is the number of different tokens used by the model to represent the input_ids.

With these parameters in mind, we can load the pretrained model.

Loading the Hugging Face BERT uncased base model

The program now loads the pretrained BERT model:

#@title Loading Hugging Face Bert uncased base model 
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.cuda()

This pretrained model can be trained further if necessary. It is interesting to explore the architecture in detail to visualize the parameters of each sub-layer as shown in the following excerpt:

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (1): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )

Let's now go through the main parameters of the optimizer.

Optimizer grouped parameters

The program will now initialize the optimizer for the model's parameters. Fine-tuning a model begins with initializing the pretrained model parameter values (not their names).

The parameters of the optimizer include a weight decay rate to avoid overfitting, and some parameters are filtered.

The goal is to prepare the model's parameters for the training loop:

##@title Optimizer Grouped Parameters
#This code is taken from:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L102
# Don't apply weight decay to any parameters whose names include these tokens.
# (Here, the BERT doesn't have 'gamma' or 'beta' parameters, only 'bias' terms)
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.weight']
# Separate the 'weight' parameters from the 'bias' parameters. 
# - For the 'weight' parameters, this specifies a 'weight_decay_rate' of 0.01. 
# - For the 'bias' parameters, the 'weight_decay_rate' is 0.0. 
optimizer_grouped_parameters = [
    # Filter for all parameters which *don't* include 'bias', 'gamma', 'beta'.
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.1},
    
    # Filter for parameters which *do* include those.
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]
# Note - 'optimizer_grouped_parameters' only includes the parameter values, not 
# the names.

The parameters have been prepared and cleaned up. They are ready for the training loop.

The hyperparameters for the training loop

The hyperparameters for the training loop are critical, though they seem innocuous. Adam will activate weight decay and also go through a warm-up phase, for example.

The learning rate (lr) and warm-up rate (warmup) should be set to a very small value early in the optimization phase and gradually increase after a certain number of iterations. This avoids large gradients and overshooting the optimization goals.

Some researchers argue that the gradients at the output level of the sub-layers before layer normalization do not require a warm-up rate. Solving this problem requires many experimental runs.

The optimizer is a BERT version of Adam called BertAdam:

#@title The Hyperparameters for the Training Loop 
optimizer = BertAdam(optimizer_grouped_parameters,
                     lr=2e-5,
                     warmup=.1)

The program adds an accuracy measurement function to compare the predictions to the labels:

#Creating the Accuracy Measurement Function
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

The data is ready, the parameters are ready. It's time to activate the training loop!

The training loop

The training loop follows standard learning processes. The number of epochs is set to 4, and there is a measurement for loss and accuracy, which will be plotted. The training loop uses the dataloader load and train batches. The training process is measured and evaluated.

The code starts by initializing the train_loss_set, which will store the loss and accuracy, which will be plotted. It starts training its epochs and runs a standard training loop, as shown in the following excerpt:

#@title The Training Loop
t = [] 
# Store our loss and accuracy for plotting
train_loss_set = []
# Number of training epochs (authors recommend between 2 and 4)
epochs = 4
# trange is a tqdm wrapper around the normal python range
for _ in trange(epochs, desc="Epoch"):
…./…
    tmp_eval_accuracy = flat_accuracy(logits, label_ids)
    
    eval_accuracy += tmp_eval_accuracy
    nb_eval_steps += 1
  print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))

The output displays the information for each epoch with the trange wrapper, for _ in trange(epochs, desc="Epoch"):

***output***
Epoch:   0%|          | 0/4 [00:00<?, ?it/s]
Train loss: 0.5381132976395461
Epoch:  25%|██▌       | 1/4 [07:54<23:43, 474.47s/it]
Validation Accuracy: 0.788966049382716
Train loss: 0.315329696132929
Epoch:  50%|█████     | 2/4 [15:49<15:49, 474.55s/it]
Validation Accuracy: 0.836033950617284
Train loss: 0.1474070605354314
Epoch:  75%|███████▌  | 3/4 [23:43<07:54, 474.53s/it]
Validation Accuracy: 0.814429012345679
Train loss: 0.07655430570461196
Epoch: 100%|██████████| 4/4 [31:38<00:00, 474.58s/it]
Validation Accuracy: 0.810570987654321

Transformer models are evolving very quickly and deprecation messages and even errors might occur. Hugging Face is no exception to this and we must update our code accordingly when this happens.

The model is trained. We can now display the training evaluation.

Training evaluation

The loss and accuracy values were stored in train_loss_set as defined at the beginning of the training loop.

The program now plots the measurements:

#@title Training Evaluation
plt.figure(figsize=(15,8))
plt.title("Training loss")
plt.xlabel("Batch")
plt.ylabel("Loss")
plt.plot(train_loss_set)
plt.show()

The output is a graph that shows that the training process went well and was efficient:

Figure 2.6: Training loss per batch

The model has been fine-tuned. We can now run predictions.

Predicting and evaluating using the holdout dataset

The BERT downstream model was trained with the in_domain_train.tsv dataset. The program will now make predictions using the holdout (testing) dataset contained in the out_of_domain_dev.tsv file. The goal is to predict whether the sentence is grammatically correct.

The following excerpt of the code shows that the data preparation process applied to the training data is repeated in the part of the code for the holdout dataset:

#@title Predicting and Evaluating Using the Holdout Dataset 
df = pd.read_csv("out_of_domain_dev.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
# Create sentence and label lists
sentences = df.sentence.values
# We need to add special tokens at the beginning and end of each sentence for BERT to work properly
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]
labels = df.label.values
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
.../...

The program then runs batch predictions using the dataloader:

# Predict 
for batch in prediction_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch
  # Telling the model not to compute or store gradients, saving memory and speeding up prediction
  with torch.no_grad():
    # Forward pass, calculate logit predictions
    logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

The logits and labels of the predictions are moved to the CPU:

  # Move logits and labels to CPU
  logits =  logits['logits'].detach().cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()

The predictions and their true labels are stored:

  # Store predictions and true labels
  predictions.append(logits)
  true_labels.append(label_ids)

The program can now evaluate the predictions.

Evaluating using Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) was initially designed to measure the quality of binary classifications and can be modified to be a multi-class correlation coefficient. A two-class classification can be made with four probabilities at each prediction:

TP = True Positive
TN = True Negative
FP = False Positive
FN = False Negative

Brian W. Matthews, a biochemist, designed it in 1975, inspired by his predecessors' phi function. Since then it has evolved into various formats such as the following one:

The value produced by MCC is between -1 and +1. +1 is the maximum positive value of a prediction. -1 is an inverse prediction. 0 is an average random prediction.

GLUE evaluates Linguistic Acceptability with MCC.

MCC is imported from sklearn.metrics:

#@title Evaluating Using Matthew's Correlation Coefficient
# Import and evaluate each test batch using Matthew's correlation coefficient
from sklearn.metrics import matthews_corrcoef

A set of predictions is created:

matthews_set = []

The MCC value is calculated and stored in matthews_set:

for i in range(len(true_labels)):
  matthews = matthews_corrcoef(true_labels[i],
                 np.argmax(predictions[i], axis=1).flatten())
  matthews_set.append(matthews)

You may see messages due to library and module version changes. The final score will be based on the entire test set, but let's take a look at the scores on the individual batches to get a sense of the variability in the metric between batches.

The score of individual batches

Let's view the score of the individual batches:

#@title Score of Individual Batches
matthews_set

The output produces MCC values between -1 and +1 as expected:

[0.049286405809014416,
 -0.2548235957188128,
 0.4732058754737091,
 0.30508307783296046,
 0.3567530340063379,
 0.8050112948805689,
 0.23329882422520506,
 0.47519096331149147,
 0.4364357804719848,
 0.4700159919404217,
 0.7679476477883045,
 0.8320502943378436,
 0.5807564950208268,
 0.5897435897435898,
 0.38461538461538464,
 0.5716350506349809,
 0.0]

Almost all the MCC values are positive, which is good news. Let's see what the evaluation is for the whole dataset.

Matthews evaluation for the whole dataset

The MCC is a practical way to evaluate a classification model.

The program will now aggregate the true values for the whole dataset:

#@title Matthew's Evaluation on the Whole Dataset
# Flatten the predictions and true values for aggregate Matthew's evaluation on the whole dataset
flat_predictions = [item for sublist in predictions for item in sublist]
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()
flat_true_labels = [item for sublist in true_labels for item in sublist]
matthews_corrcoef(flat_true_labels, flat_predictions)