Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Transformers for Natural Language Processing
Transformers for Natural Language Processing

Transformers for Natural Language Processing: Build innovative deep neural network architectures for NLP with Python, PyTorch, TensorFlow, BERT, RoBERTa, and more

eBook
$55.98 $79.99
Paperback
$99.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Table of content icon View table of contents Preview book icon Preview Book

Transformers for Natural Language Processing

Fine-Tuning BERT Models

In Chapter 1, Getting Started with the Model Architecture of the Transformer, we defined the building blocks of the architecture of the original Transformer. Think of the original Transformer as a model built with LEGO® bricks. The construction set contains bricks such as encoders, decoders, embedding layers, positional encoding methods, multi-head attention layers, masked multi-head attention layers, post-layer normalization, feed-forward sub-layers, and linear output layers. The bricks come in various sizes and forms. You can spend hours building all sorts of models using the same building kit! Some constructions will only require some of the bricks. Other constructions will add a new piece, just like when we obtain additional bricks for a model built using LEGO® components.

BERT added a new piece to the Transformer building kit: a bidirectional multi-head attention sub-layer. When we humans are having problems understanding a sentence, we do not just look at the past words. BERT, like us, looks at all the words in the same sentence at the same time.

In this chapter, we will first explore the architecture of Bidirectional Encoder Representations from Transformers (BERT). BERT only uses the blocks of the encoders of the Transformer in a novel way and does not use the decoder stack.

Then we will fine-tune a pretrained BERT model. The BERT model we will fine-tune was trained by a third party and uploaded to Hugging Face. Transformers can be pretrained. Then, a pretrained BERT, for example, can be fine-tuned on several NLP tasks. We will go through this fascinating experience of downstream Transformer usage using Hugging Face modules.

This chapter covers the following topics:

  • Bidirectional Encoder Representations from Transformers (BERT)
  • The architecture of BERT
  • The two-step BERT framework
  • Preparing the pretraining environment
  • Defining pretraining encoder layers
  • Defining fine-tuning
  • Downstream multitasking
  • Building a fine-tuning BERT model
  • Loading an accessibility judgement dataset
  • Creating attention masks
  • BERT model configuration
  • Measuring the performance of the fine-tuned model

Our first step will be to explore the background of the Transformer.

The architecture of BERT

BERT introduces bidirectional attention to transformer models. Bidirectional attention requires many other changes to the original Transformer model.

We will not go through the building blocks of transformers described in Chapter 1, Getting Started with the Model Architecture of the Transformer. You can consult Chapter 1 at any time to review an aspect of the building blocks of transformers. In this section, we will focus on the specific aspects of BERT models.

We will focus on the evolutions designed by Devlin et al. (2018), which describe the encoder stack.

We will first go through the encoder stack, then the preparation of the pretraining input environment. Then we will describe the two-step framework of BERT: pretraining and fine-tuning.

Let's first explore the encoder stack.

The encoder stack

The first building block we will take from the original Transformer model is an encoder layer. The encoder layer as described in Chapter 1, Getting Started with the Model Architecture of the Transformer, is shown in Figure 2.1:

Figure 2.1: The encoder layer

The BERT model does not use decoder layers. A BERT model has an encoder stack but no decoder stacks. The masked tokens (hiding the tokens to predict) are in the attention layers of the encoder, as we will see when we zoom into a BERT encoder layer in the following sections.

The original Transformer contains a stack of N=6 layers. The number of dimensions of the original Transformer is dmodel = 512. The number of attention heads of the original Transformer is A=8. The dimensions of a head of the original Transformer is:

BERT encoder layers are larger than the original Transformer model.

Two BERT models can be built with the encoder layers:

  • BERTBASE, which contains a stack of N=12 encoder layers. dmodel = 768 and can also be expressed as H=768, as in the BERT paper. A multi-head attention sub-layer contains A=12 heads. The dimensions of each head zA remains 64 as in the original Transformer model:

    The output of each multi-head attention sub-layer before concatenation will be the output of the 12 heads:

    output_multi-head_attention={z0, z1, z2,…,z11}

  • BERTLARGE, which contains a stack of N=24 encoder layers. dmodel = 1024. A multi-head attention sub-layer contains A=16 heads. The dimensions of each head zA also remains 64 as in the original Transformer model:

    The output of each multi-head attention sub-layer before concatenation will be the output of the 16 heads:

    output_multi-head_attention={z0, z1, z2,…,z15}

The sizes of the models can be summed up as follows:

Figure 2.2: Transformer models

Size and dimensions play an essential role in BERT-style pretraining. BERT models are like humans. BERT models produce better results with more working memory (dimensions), and more knowledge (data). Large transformer models that learn large amounts of data will pretrain better for downstream NLP tasks.

Let's now go to the first sub-layer and see the fundamental aspects of input embedding and positional encoding in a BERT model.

Preparing the pretraining input environment

The BERT model has no decoder stack of layers. As such, it does not have a masked multi-head attention sub-layer. BERT goes further and states that a masked multi-head attention layer that masks the rest of the sequence impedes the attention process.

A masked multi-head attention layer masks all of the tokens that are beyond the present position. For example, take the following sentence:

The cat sat on it because it was a nice rug.

If we have just reached the word "it," the input of the encoder could be:

The cat sat on it<masked sequence>

The motivation of this approach is to prevent the model from seeing the output it is supposed to predict. This left-to-right approach produces relatively good results.

However, the model cannot learn much this way. To know what "it" refers to, we need to see the whole sentence to reach the word "rug" and figure out that "it" was the rug.

The authors of BERT came up with an idea. Why not pretrain the model to make predictions using a different approach?

The authors of BERT came up with bidirectional attention, letting an attention head attend to all of the words both from left to right and right to left. In other words, the self-attention mask of an encoder could do the job without being hindered by the masked multi-head attention sub-layer of the decoder.

The model was trained with two tasks. The first method is Masked Language Modeling (MLM). The second method is Next Sentence Prediction (NSP).

Let's start with masked language modeling.

Masked language modeling

Masked language modeling does not require training a model with a sequence of visible words followed by a masked sequence to predict.

BERT introduces the bidirectional analysis of a sentence with a random mask on a word of the sentence.

It is important to note that BERT applies WordPiece, a sub-word segmentation method, tokenization to the inputs. It also uses learned positional encoding, not the sine-cosine approach.

A potential input sequence could be:

"The cat sat on it because it was a nice rug."

The decoder would mask the attention sequence after the model reached the word "it":

"The cat sat on it <masked sequence>."

But the BERT encoder masks a random token to make a prediction:

"The cat sat on it [MASK] it was a nice rug."

The multi-attention sub-layer can now see the whole sequence, run the self-attention process, and predict the masked token.

The input tokens were masked in a tricky way to force the model to train longer but produce better results with three methods:

  • Surprise the model by not masking a single token on 10% of the dataset; for example:
    "The cat sat on it [because] it was a nice rug."
    
  • Surprise the model by replacing the token with a random token on 10% of the dataset; for example:
    "The cat sat on it [often] it was a nice rug."
    
  • Replace a token by a [MASK] token on 80% of the dataset; for example:
    "The cat sat on it [MASK] it was a nice rug."
    

The authors' bold approach avoids overfitting and forces the model to train efficiently.

BERT was also trained to perform next sentence prediction.

Next sentence prediction

The second method found to train BERT is Next Sentence Prediction (NSP). The input contains two sentences.

Two new tokens were added:

  • [CLS] is a binary classification token added to the beginning of the first sequence to predict if the second sequence follows the first sequence. A positive sample is usually a pair of consecutive sentences taken from a dataset. A negative sample is created using sequences from different documents.
  • [SEP] is a separation token that signals the end of a sequence.

For example, the input sentences taken out of a book could be:

"The cat slept on the rug. It likes sleeping all day."

These two sentences would become one input complete sequence:

[CLS] the cat slept on the rug [SEP] it likes sleep ##ing all day[SEP]

This approach requires additional encoding information to distinguish sequence A from sequence B.

If we put the whole embedding process together, we obtain:

Figure 2.3: Input embeddings

The input embeddings are obtained by summing the token embeddings, the segment (sentence, phrase, word) embeddings, and the positional encoding embeddings.

The input embedding and positional encoding sub-layer of a BERT model can be summed up as follows:

  • A sequence of words is broken down into WordPiece tokens.
  • A [MASK] token will randomly replace the initial word tokens for masked language modeling training.
  • A [CLS] classification token is inserted at the beginning of a sequence for classification purposes.
  • A [SEP] token separates two sentences (segments, phrases) for NSP training.
  • Sentence embedding is added to token embedding, so that sentence A has a different sentence embedding value than sentence B.
  • Positional encoding is learned. The sine-cosine positional encoding method of the original Transformer is not applied.

Some additional key features are:

  • BERT uses bidirectional attention in all of its multi-head attention sub-layers, opening vast horizons of learning and understanding relationships between tokens.
  • BERT introduces scenarios of unsupervised embedding, pretraining models with unlabeled text. This forces the model to think harder during the multi-head attention learning process. This makes BERT able to learn how languages are built and apply this knowledge to downstream tasks without having to pretrain each time.
  • BERT also uses supervised learning, covering all bases in the pretraining process.

BERT has improved the training environment of transformers. Let's now see the motivation of pretraining and how it helps the fine-tuning process.

Pretraining and fine-tuning a BERT model

BERT is a two-step framework. The first step is the pretraining, and the second is fine-tuning, as shown in Figure 2.4:

Figure 2.4: The BERT framework

Training a transformer model can take hours, if not days. It takes quite some time to engineer the architecture and parameters, and select the proper datasets to train a transformer model.

Pretraining is the first step of the BERT framework that can be broken down into two sub-steps:

  • Defining the model's architecture: number of layers, number of heads, dimensions, and the other building blocks of the model
  • Training the model on Masked Language Modeling (MLM) and NSP tasks

The second step of the BERT framework is fine-tuning, which can also be broken down into two sub-steps:

  • Initializing the downstream model chosen with the trained parameters of the pretrained BERT model
  • Fine-tuning the parameters for specific downstream tasks such as Recognizing Textual Entailment (RTE), Question Answering (SQuAD v1.1, SQuAD v2.0), and Situations With Adversarial Generations (SWAG)

In this section, we covered the information we need to fine-tune a BERT model. In the following chapters, we will explore the topics we brought up in this section in more depth:

  • In Chapter 3, Pretraining a RoBERTa Model from Scratch, we will pretrain a BERT-like model from scratch in 15 steps. We will even compile our own data, train a tokenizer, and then train the model. The goal of this chapter is to first go through the specific building blocks of BERT and then fine-tune an existing model.
  • In Chapter 4, Downstream NLP Tasks with Transformers, we will go through many downstream NLP tasks, exploring GLUE, SQuAD v1.1, SQuAD, SWAG, BLEU, and several other NLP evaluation datasets. We will run several downstream transformer models to illustrate key tasks. The goal of this chapter is to fine-tune a downstream model.
  • In Chapter 6, Text Generation with OpenAI GPT-2 and GPT-3 Models, we will explore the architecture and usage of Open AI GPT, GPT-2, and GPT-3 transformers. BERTBASE was configured to be close to OpenAI GPT to show that it produced better performance. However, OpenAI transformers keep evolving too! We will see how.

In this chapter, the BERT model we will fine-tune will be trained on The Corpus of Linguistic Acceptability (CoLA). The downstream task is based on Neural Network Acceptability Judgments by Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman.

We will fine-tune a BERT model that will determine the grammatical acceptability of a sentence. The fine-tuned model will have acquired a certain level of linguistic competence.

We have gone through BERT architecture and its pretraining and fine-tuning framework. Let's now fine-tune a BERT model.

Fine-tuning BERT

In this section, we will fine-tune a BERT model to predict the downstream task of Acceptability Judgements and measure the predictions with the Matthews Correlation Coefficient (MCC), which will be explained in the Evaluating using Matthews Correlation Coefficient section of this chapter.

Open BERT_Fine_Tuning_Sentence_Classification_DR.ipynb in Google Colab (make sure you have an email account). The notebook is in Chapter02 of the GitHub repository of this book.

The title of each cell in the notebook is also the same, or very close to the title of each subsection of this chapter.

Let's start making sure the GPU is activated.

Activating the GPU

Pretraining a multi-head attention transformer model requires the parallel processing GPUs can provide.

The program first starts by checking if the GPU is activated:

#@title Activating the GPU
# Main menu->Runtime->Change Runtime Type
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

The output should be:

Found GPU at: /device:GPU:0

The program will be using Hugging Face modules.

Installing the Hugging Face PyTorch interface for BERT

Hugging Face provides a pretrained BERT model. Hugging Face developed a base class named PreTrainedModel. By installing this class, we can load a model from a pretrained model configuration.

Hugging Face provides modules in TensorFlow and PyTorch. I recommend that a developer feels comfortable with both environments. Excellent AI research teams use either or both environments.

In this chapter, we will install the modules required as follows:

#@title Installing the Hugging Face PyTorch Interface for Bert
!pip install -q transformers

The installation will run, or requirement satisfied messages will be displayed.

We can now import the modules needed for the program.

Importing the modules

We will import the pretrained modules required, such as the pretrained BERT tokenizer and the configuration of the BERT model. The BERTAdam optimizer is imported along with the sequence classification module:

#@title Importing the modules
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertConfig
from transformers import AdamW, BertForSequenceClassification, get_linear_schedule_with_warmup

A nice progress bar module is imported from tqdm:

from tqdm import tqdm, trange

We can now import the widely used standard Python modules:

import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt

No message will be displayed if all goes well, bearing in mind that Google Colab has pre-installed the modules on the VM we are using.

Specifying CUDA as the device for torch

We will now specify that torch uses the Compute Unified Device Architecture (CUDA) to put the parallel computing power of the NVIDIA card to work for our multi-head attention model:

#@title Specify CUDA as device for Torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

The VM I ran on Google Colab displayed the following output:

'Tesla P100-PCIE-16GB'

The output may vary with Google Colab configurations.

We will now load the dataset.

Loading the dataset

We will now load the CoLA based on the Warstadt et al. (2018) paper.

General Language Understanding Evaluation (GLUE) considers Linguistic Acceptability as a top-priority NLP task. In Chapter 4, Downstream NLP Tasks with Transformers, we will explore the key tasks transformers must perform to prove their efficiency.

Use the Google Colab file manager to upload in_domain_train.tsv and out_of_domain_dev.tsv, which you will find on GitHub in the Chapter02 directory of the repository of the book.

You should see them appear in the file manager:

Figure 2.5: Uploading the datasets

Now the program will load the datasets:

#@title Loading the Dataset
#source of dataset : https://nyu-mll.github.io/CoLA/
df = pd.read_csv("in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
df.shape

The output displays the shape of the dataset we have imported:

(8551, 4)

A 10-line sample is displayed to visualize the Acceptability Judgment task and see if a sequence makes sense or not:

df.sample(10)

The output shows 10 lines of the labeled dataset:

sentence_source	label	label_notes	sentence
1742	r-67		1	NaN		they said that tom would n't pay up , but pay ...
937	bc01		1	NaN		although he likes cabbage too , fred likes egg...
5655	c_13		1	NaN		wendy 's mother country is iceland .
500	bc01		0	*		john is wanted to win .
4596	ks08		1	NaN		i did n't find any bugs in my bed .
7412	sks13		1	NaN		the girl he met at the departmental party will...
8456	ad03		0	*		peter is the old pigs .
744	bc01		0	*		frank promised the men all to leave .
5420	b_73		0	*		i 've seen as much of a coward as frank .
5749	c_13		1	NaN		we drove all the way to buenos aires .

Each sample in the .tsv files contains four tab-separated columns:

  • Column 1: the source of the sentence (code)
  • Column 2: the label (0=unacceptable, 1=acceptable)
  • Column 3: the label annotated by the author
  • Column 4: the sentence to be classified

You can open the .tsv files locally to read a few samples of the dataset. The program will now process the data for the BERT model.

Creating sentences, label lists, and adding BERT tokens

The program will now create the sentences as described in the Preparing the pretraining input environment section of this chapter:

#@ Creating sentence, label lists and adding Bert tokens
sentences = df.sentence.values
# Adding CLS and SEP tokens at the beginning and end of each sentence for BERT
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]
labels = df.label.values

The [CLS] and [SEP] have now been added.

The program now activates the tokenizer.

Activating the BERT tokenizer

In this section, we will initialize a pretrained BERT tokenizer. This will save the time it would take to train it from scratch.

The program selects an uncased tokenizer, activates it, and displays the first tokenized sentence:

#@title Activating the BERT Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print ("Tokenize the first sentence:")
print (tokenized_texts[0])

The output contains the classification token and the sequence segmentation token:

Tokenize the first sentence:
['[CLS]', 'our', 'friends', 'wo', 'n', "'", 't', 'buy', 'this', 'analysis', ',', 'let', 'alone', 'the', 'next', 'one', 'we', 'propose', '.', '[SEP]']

The program will now process the data.

Processing the data

We need to determine a fixed maximum length and process the data for the model. The sentences in the datasets are short. But, to make sure of this, the program sets the maximum length of a sequence to 512 and the sequences are padded:

#@title Processing the data
# Set the maximum sequence length. The longest sequence in our training set is 47, but we'll leave room on the end anyway. 
# In the original paper, the authors used a length of 512.
MAX_LEN = 128
# Use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
# Pad our input tokens
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

The sequences have been processed and now the program creates the attention masks.

Creating attention masks

Now comes a tricky part of the process. We padded the sequences in the previous cell. But we want to prevent the model from performing attention on those padded tokens!

The idea is to apply a mask with a value of 1 for each token, which will be followed by 0s for padding:

#@title Create attention masks
attention_masks = []
# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
  seq_mask = [float(i>0) for i in seq]
  attention_masks.append(seq_mask)

The program will now split the data.

Splitting data into training and validation sets

The program now performs the standard process of splitting the data into training and validation sets:

#@title Splitting data into train and validation sets
# Use train_test_split to split our data into train and validation sets for training
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels, random_state=2018, test_size=0.1)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, input_ids,random_state=2018, test_size=0.1)

The data is ready to be trained, but it still needs to be adapted to torch.

Converting all the data into torch tensors

The fine-tuning model uses torch tensors. The program must convert the data into torch tensors:

#@title Converting all the data into torch tensors
# Torch tensors are the required datatype for our model
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

The conversion is over. Now we need to create an iterator.

Selecting a batch size and creating an iterator

In this cell, the program selects a batch size and creates an iterator. The iterator is a clever way of avoiding a loop that would load all the data in memory. The iterator, coupled with the torch DataLoader, can batch train huge datasets without crashing the memory of the machine.

In this model, the batch size is 32:

#@title Selecting a Batch Size and Creating and Iterator
# Select a batch size for training. For fine-tuning BERT on a specific task, the authors recommend a batch size of 16 or 32
batch_size = 32
# Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop, 
# with an iterator the entire dataset does not need to be loaded into memory
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

The data has been processed and is all set. The program can now load and configure the BERT model.

BERT model configuration

The program now initializes a BERT uncased configuration:

#@title BERT Model Configuration
# Initializing a BERT bert-base-uncased style configuration
#@title Transformer Installation
try:
  import transformers
except:
  print("Installing transformers")
  !pip -qq install transformers
  
from transformers import BertModel, BertConfig
configuration = BertConfig()
# Initializing a model from the bert-base-uncased style configuration
model = BertModel(configuration)
# Accessing the model configuration
configuration = model.config
print(configuration)

The output displays the main Hugging Face parameters similar to the following (the library is often updated):

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

Let's go through these main parameters:

  • attention_probs_dropout_prob: 0.1 applies a 0.1 dropout ratio to the attention probabilities.
  • hidden_act: "gelu" is a non-linear activation function in the encoder. It is a Gaussian Error Linear Units activation function. The input is weighted by its magnitude, which makes it non-linear.
  • hidden_dropout_prob: 0.1 is the dropout probability applied to the fully connected layers. Full connections can be found in the embeddings, encoder, and pooler layers. The pooler is there to convert the sequence tensor for classification tasks, which require a fixed dimension to represent the sequence. The pooler will thus convert the sequence tensor to (batch size, hidden size), which are fixed parameters.
  • hidden_size: 768 is the dimension of the encoded layers and also the pooler layer.
  • initializer_range: 0.02 is the standard deviation value when initializing the weight matrices.
  • intermediate_size: 3072 is the dimension of the feed-forward layer of the encoder.
  • layer_norm_eps: 1e-12 is the epsilon value for layer normalization layers.
  • max_position_embeddings: 512 is the maximum length the model uses.
  • model_type: "bert" is the name of the model.
  • num_attention_heads: 12 is the number of heads.
  • num_hidden_layers: 12 is the number of layers.
  • pad_token_id: 0 is the ID of the padding token to avoid training padding tokens.
  • type_vocab_size: 2 is the size of the token_type_ids, which identify the sequences. For example, "the dog[SEP] The cat.[SEP]" can be represented with 6 token IDs: [0,0,0, 1,1,1].
  • vocab_size: 30522 is the number of different tokens used by the model to represent the input_ids.

With these parameters in mind, we can load the pretrained model.

Loading the Hugging Face BERT uncased base model

The program now loads the pretrained BERT model:

#@title Loading Hugging Face Bert uncased base model 
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.cuda()

This pretrained model can be trained further if necessary. It is interesting to explore the architecture in detail to visualize the parameters of each sub-layer as shown in the following excerpt:

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (1): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )

Let's now go through the main parameters of the optimizer.

Optimizer grouped parameters

The program will now initialize the optimizer for the model's parameters. Fine-tuning a model begins with initializing the pretrained model parameter values (not their names).

The parameters of the optimizer include a weight decay rate to avoid overfitting, and some parameters are filtered.

The goal is to prepare the model's parameters for the training loop:

##@title Optimizer Grouped Parameters
#This code is taken from:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L102
# Don't apply weight decay to any parameters whose names include these tokens.
# (Here, the BERT doesn't have 'gamma' or 'beta' parameters, only 'bias' terms)
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.weight']
# Separate the 'weight' parameters from the 'bias' parameters. 
# - For the 'weight' parameters, this specifies a 'weight_decay_rate' of 0.01. 
# - For the 'bias' parameters, the 'weight_decay_rate' is 0.0. 
optimizer_grouped_parameters = [
    # Filter for all parameters which *don't* include 'bias', 'gamma', 'beta'.
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.1},
    
    # Filter for parameters which *do* include those.
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]
# Note - 'optimizer_grouped_parameters' only includes the parameter values, not 
# the names.

The parameters have been prepared and cleaned up. They are ready for the training loop.

The hyperparameters for the training loop

The hyperparameters for the training loop are critical, though they seem innocuous. Adam will activate weight decay and also go through a warm-up phase, for example.

The learning rate (lr) and warm-up rate (warmup) should be set to a very small value early in the optimization phase and gradually increase after a certain number of iterations. This avoids large gradients and overshooting the optimization goals.

Some researchers argue that the gradients at the output level of the sub-layers before layer normalization do not require a warm-up rate. Solving this problem requires many experimental runs.

The optimizer is a BERT version of Adam called BertAdam:

#@title The Hyperparameters for the Training Loop 
optimizer = BertAdam(optimizer_grouped_parameters,
                     lr=2e-5,
                     warmup=.1)

The program adds an accuracy measurement function to compare the predictions to the labels:

#Creating the Accuracy Measurement Function
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

The data is ready, the parameters are ready. It's time to activate the training loop!

The training loop

The training loop follows standard learning processes. The number of epochs is set to 4, and there is a measurement for loss and accuracy, which will be plotted. The training loop uses the dataloader load and train batches. The training process is measured and evaluated.

The code starts by initializing the train_loss_set, which will store the loss and accuracy, which will be plotted. It starts training its epochs and runs a standard training loop, as shown in the following excerpt:

#@title The Training Loop
t = [] 
# Store our loss and accuracy for plotting
train_loss_set = []
# Number of training epochs (authors recommend between 2 and 4)
epochs = 4
# trange is a tqdm wrapper around the normal python range
for _ in trange(epochs, desc="Epoch"):
…./…
    tmp_eval_accuracy = flat_accuracy(logits, label_ids)
    
    eval_accuracy += tmp_eval_accuracy
    nb_eval_steps += 1
  print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))

The output displays the information for each epoch with the trange wrapper, for _ in trange(epochs, desc="Epoch"):

***output***
Epoch:   0%|          | 0/4 [00:00<?, ?it/s]
Train loss: 0.5381132976395461
Epoch:  25%|██▌       | 1/4 [07:54<23:43, 474.47s/it]
Validation Accuracy: 0.788966049382716
Train loss: 0.315329696132929
Epoch:  50%|█████     | 2/4 [15:49<15:49, 474.55s/it]
Validation Accuracy: 0.836033950617284
Train loss: 0.1474070605354314
Epoch:  75%|███████▌  | 3/4 [23:43<07:54, 474.53s/it]
Validation Accuracy: 0.814429012345679
Train loss: 0.07655430570461196
Epoch: 100%|██████████| 4/4 [31:38<00:00, 474.58s/it]
Validation Accuracy: 0.810570987654321

Transformer models are evolving very quickly and deprecation messages and even errors might occur. Hugging Face is no exception to this and we must update our code accordingly when this happens.

The model is trained. We can now display the training evaluation.

Training evaluation

The loss and accuracy values were stored in train_loss_set as defined at the beginning of the training loop.

The program now plots the measurements:

#@title Training Evaluation
plt.figure(figsize=(15,8))
plt.title("Training loss")
plt.xlabel("Batch")
plt.ylabel("Loss")
plt.plot(train_loss_set)
plt.show()

The output is a graph that shows that the training process went well and was efficient:

Figure 2.6: Training loss per batch

The model has been fine-tuned. We can now run predictions.

Predicting and evaluating using the holdout dataset

The BERT downstream model was trained with the in_domain_train.tsv dataset. The program will now make predictions using the holdout (testing) dataset contained in the out_of_domain_dev.tsv file. The goal is to predict whether the sentence is grammatically correct.

The following excerpt of the code shows that the data preparation process applied to the training data is repeated in the part of the code for the holdout dataset:

#@title Predicting and Evaluating Using the Holdout Dataset 
df = pd.read_csv("out_of_domain_dev.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
# Create sentence and label lists
sentences = df.sentence.values
# We need to add special tokens at the beginning and end of each sentence for BERT to work properly
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]
labels = df.label.values
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
.../...

The program then runs batch predictions using the dataloader:

# Predict 
for batch in prediction_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch
  # Telling the model not to compute or store gradients, saving memory and speeding up prediction
  with torch.no_grad():
    # Forward pass, calculate logit predictions
    logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

The logits and labels of the predictions are moved to the CPU:

  # Move logits and labels to CPU
  logits =  logits['logits'].detach().cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()

The predictions and their true labels are stored:

  # Store predictions and true labels
  predictions.append(logits)
  true_labels.append(label_ids)

The program can now evaluate the predictions.

Evaluating using Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) was initially designed to measure the quality of binary classifications and can be modified to be a multi-class correlation coefficient. A two-class classification can be made with four probabilities at each prediction:

  • TP = True Positive
  • TN = True Negative
  • FP = False Positive
  • FN = False Negative

Brian W. Matthews, a biochemist, designed it in 1975, inspired by his predecessors' phi function. Since then it has evolved into various formats such as the following one:

The value produced by MCC is between -1 and +1. +1 is the maximum positive value of a prediction. -1 is an inverse prediction. 0 is an average random prediction.

GLUE evaluates Linguistic Acceptability with MCC.

MCC is imported from sklearn.metrics:

#@title Evaluating Using Matthew's Correlation Coefficient
# Import and evaluate each test batch using Matthew's correlation coefficient
from sklearn.metrics import matthews_corrcoef

A set of predictions is created:

matthews_set = []

The MCC value is calculated and stored in matthews_set:

for i in range(len(true_labels)):
  matthews = matthews_corrcoef(true_labels[i],
                 np.argmax(predictions[i], axis=1).flatten())
  matthews_set.append(matthews)

You may see messages due to library and module version changes. The final score will be based on the entire test set, but let's take a look at the scores on the individual batches to get a sense of the variability in the metric between batches.

The score of individual batches

Let's view the score of the individual batches:

#@title Score of Individual Batches
matthews_set

The output produces MCC values between -1 and +1 as expected:

[0.049286405809014416,
 -0.2548235957188128,
 0.4732058754737091,
 0.30508307783296046,
 0.3567530340063379,
 0.8050112948805689,
 0.23329882422520506,
 0.47519096331149147,
 0.4364357804719848,
 0.4700159919404217,
 0.7679476477883045,
 0.8320502943378436,
 0.5807564950208268,
 0.5897435897435898,
 0.38461538461538464,
 0.5716350506349809,
 0.0]

Almost all the MCC values are positive, which is good news. Let's see what the evaluation is for the whole dataset.

Matthews evaluation for the whole dataset

The MCC is a practical way to evaluate a classification model.

The program will now aggregate the true values for the whole dataset:

#@title Matthew's Evaluation on the Whole Dataset
# Flatten the predictions and true values for aggregate Matthew's evaluation on the whole dataset
flat_predictions = [item for sublist in predictions for item in sublist]
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()
flat_true_labels = [item for sublist in true_labels for item in sublist]
matthews_corrcoef(flat_true_labels, flat_predictions)

The output confirms that the MCC is positive, which shows that there is a correlation for this model and dataset:

0.45439842471680725

On this final positive evaluation of the fine-tuning of the BERT model, we have an overall view of the BERT training framework.

Summary

BERT brings bidirectional attention to transformers. Predicting sequences from left to right and masking the future tokens to train a model has serious limitations. If the masked sequence contains the meaning we are looking for, the model will produce errors. BERT attends to all of the tokens of a sequence at the same time.

We explored the architecture of BERT, which only uses the encoder stack of transformers. BERT was designed as a two-step framework. The first step of the framework is to pretrain a model. The second step is to fine-tune the model. We built a fine-tuning BERT model for an Acceptability Judgement downstream task. The fine-tuning process went through all phases of the process. First, we loaded the dataset and loaded the necessary pretrained modules of the model. Then the model was trained, and its performance measured.

Fine-tuning a pretrained model takes fewer machine resources than training downstream tasks from scratch. Fine-tuned models can perform a variety of tasks. BERT proves that we can pretrain a model on two tasks only, which is remarkable in itself. But producing a multitask fine-tuned model based on the trained parameters of the BERT pretrained model is extraordinary. OpenAI GPT had worked on this approach before, but BERT took it to another level!

In this chapter, we fine-tuned a BERT model. In the next chapter, Chapter 3, Pretraining a RoBERTa Model from Scratch, we will dig deeper into the BERT framework and build a pretraining BERT-like model from scratch.

Questions

  1. BERT stands for Bidirectional Encoder Representations from Transformers. (True/False)
  2. BERT is a two-step framework. Step 1 is pretraining. Step 2 is fine-tuning. (True/False)
  3. Fine-tuning a BERT model implies training parameters from scratch. (True/False)
  4. BERT only pretrains using all downstream tasks. (True/False)
  5. BERT pretrains with Masked Language Modeling (MLM). (True/False)
  6. BERT pretrains with Next Sentence Predictions (NSP). (True/False)
  7. BERT pretrains mathematical functions. (True/False)
  8. A question-answer task is a downstream task. (True/False)
  9. A BERT pretraining model does not require tokenization. (True/False)
  10. Fine-tuning a BERT model takes less time than pretraining. (True/False)

References

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Build and implement state-of-the-art language models, such as the original Transformer, BERT, T5, and GPT-2, using concepts that outperform classical deep learning models
  • Go through hands-on applications in Python using Google Colaboratory Notebooks with nothing to install on a local machine
  • Test transformer models on advanced use cases

Description

The transformer architecture has proved to be revolutionary in outperforming the classical RNN and CNN models in use today. With an apply-as-you-learn approach, Transformers for Natural Language Processing investigates in vast detail the deep learning for machine translations, speech-to-text, text-to-speech, language modeling, question answering, and many more NLP domains with transformers. The book takes you through NLP with Python and examines various eminent models and datasets within the transformer architecture created by pioneers such as Google, Facebook, Microsoft, OpenAI, and Hugging Face. The book trains you in three stages. The first stage introduces you to transformer architectures, starting with the original transformer, before moving on to RoBERTa, BERT, and DistilBERT models. You will discover training methods for smaller transformers that can outperform GPT-3 in some cases. In the second stage, you will apply transformers for Natural Language Understanding (NLU) and Natural Language Generation (NLG). Finally, the third stage will help you grasp advanced language understanding techniques such as optimizing social network datasets and fake news identification. By the end of this NLP book, you will understand transformers from a cognitive science perspective and be proficient in applying pretrained transformer models by tech giants to various datasets.

Who is this book for?

Since the book does not teach basic programming, you must be familiar with neural networks, Python, PyTorch, and TensorFlow in order to learn their implementation with Transformers. Readers who can benefit the most from this book include experienced deep learning & NLP practitioners and data analysts & data scientists who want to process the increasing amounts of language-driven data.

What you will learn

  • Use the latest pretrained transformer models
  • Grasp the workings of the original Transformer, GPT-2, BERT, T5, and other transformer models
  • Create language understanding Python programs using concepts that outperform classical deep learning models
  • Use a variety of NLP platforms, including Hugging Face, Trax, and AllenNLP
  • Apply Python, TensorFlow, and Keras programs to sentiment analysis, text summarization, speech recognition, machine translations, and more
  • Measure the productivity of key transformers to define their scope, potential, and limits in production
Estimated delivery fee Deliver to Egypt

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$34.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jan 29, 2021
Length: 384 pages
Edition : 1st
Language : English
ISBN-13 : 9781800565791
Category :
Languages :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Estimated delivery fee Deliver to Egypt

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$34.95
(Includes tracking information)

Product Details

Publication date : Jan 29, 2021
Length: 384 pages
Edition : 1st
Language : English
ISBN-13 : 9781800565791
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 198.97
Getting Started with Google BERT
$43.99
Mastering Transformers
$54.99
Transformers for Natural Language Processing
$99.99
Total $ 198.97 Stars icon

Table of Contents

14 Chapters
Getting Started with the Model Architecture of the Transformer Chevron down icon Chevron up icon
Fine-Tuning BERT Models Chevron down icon Chevron up icon
Pretraining a RoBERTa Model from Scratch Chevron down icon Chevron up icon
Downstream NLP Tasks with Transformers Chevron down icon Chevron up icon
Machine Translation with the Transformer Chevron down icon Chevron up icon
Text Generation with OpenAI GPT-2 and GPT-3 Models Chevron down icon Chevron up icon
Applying Transformers to Legal and Financial Documents for AI Text Summarization Chevron down icon Chevron up icon
Matching Tokenizers and Datasets Chevron down icon Chevron up icon
Semantic Role Labeling with BERT-Based Transformers Chevron down icon Chevron up icon
Let Your Data Do the Talking: Story, Questions, and Answers Chevron down icon Chevron up icon
Detecting Customer Emotions to Make Predictions Chevron down icon Chevron up icon
Analyzing Fake News with Transformers Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Most Recent
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.2
(37 Ratings)
5 star 62.2%
4 star 16.2%
3 star 8.1%
2 star 5.4%
1 star 8.1%
Filter icon Filter
Most Recent

Filter reviews by




William May 09, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The author explains concepts very well with great examples. I am a visual learner and this books us the visual learning approach which helped me grasp concepts very well.
Amazon Verified review Amazon
Jerome Massot Oct 01, 2021
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
This book, as many others about the "NLP transformers revolution", has almost no interest. It is just a collection of verbiage made upon easy notebooks reimplementing the typical Python snippets available freely on the HF and AllenNLP websites. It is almost a shame to see so many editors publishing such works which do not bring anything new about the subjects and at best reproduce the knowledge proposed for FREE on HF or AllenNLP websites. Let's take the example of Chapter 10 about QA. The author proposes to generate questions to be asked to the QA pretrained model. He is proposing to use NER for identifying entities in the text context and then writes a snippet to hardcode questions based on the entity categories.... OMG, it is not at all how we do such questions generation process with Transformers... If you want to start discussing how to generate questions from a context (and honestly I do not see the point here), at least use a Transformer model in an inverse close tasks manner. You will teach something new to your readers and keep credibility at the same time.So to summarize, do not waste your money with this kind of books pretending to teach you how to do NLP with Transformers. Plan 2 months of homework studying HF, Stanza and AllenNLP repo and watch the Stanford YouTube videos from Manning. Free and much much better.
Amazon Verified review Amazon
Chandrakant Kantilal Bhogayata Aug 30, 2021
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
This is the book for which I was finding from the last six months.It expertly introduces transformers and mentors the reader for building innovative deep neural network architectures for NLP.The book covers almost all game-changing applications for natural language processing (NLP), natural language understanting (NLU), and natural laguage generation (NLG).The book is very useful even for beginners in the domain as the questions of each chapter are answered in the Appendix.
Amazon Verified review Amazon
Pankaj Aug 13, 2021
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
It has nothing but some different flavors of "hello world" transformer codes. Not good for beginners or experts. Writing is pathetic.
Amazon Verified review Amazon
Andrey Aug 01, 2021
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
I don’t think authors understand how self attention works. Very disappointing :(
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela