Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon

Text Classification with Transformers

Save for later
  • 9 min read
  • 28 Aug 2023

article-image

Introduction

This blog aims to implement binary text classification using a transformer architecture. If you're new to transformers, the "Transformer Building Blocks" blog explains the architecture and its text generation implementation. Beyond text generation and translation, transformers serve classification, sentiment analysis, and speech recognition. The transformer model comprises two parts: an encoder and a decoder. The encoder extracts features, while the decoder processes them. Just as a painter with tree features can draw, describe, visualize, categorize, or write about a tree, transformers encode knowledge (encoder) and apply it (decoder). This dual-part process is pivotal for text classification with transformers, allowing them to excel in diverse tasks like sentiment analysis, illustrating their transformative role in NLP.

Deep Dive into Text Classification with Transformers

We train the model on the IMDB dataset. The dataset is ready and there’s no preprocessing needed. The model is vocab-based instead of character-based so that the model can converge faster. I limited the dataset vocabs to the 20000 most frequent vocabs. I also reduced the sequence to 200 so we can train faster. I tried to simplify the model and use torch.nn.MultiheadAttention it instead of writing the Multihead-attention ourselves. It makes the model faster since the nn.MultiheadAttention uses scaled_dot_product_attention under the hood. But if you want to know how MultiheadAttention works you can study the transformer building blocks blog or see the code here.

Okay, now, let us add the feature extractor part:

class transformer_block(nn.Module):
   def __init__(self):
       super(block, self).__init__()
       self.attention = nn.MultiheadAttention(embeds_size, num_heads, batch_first=True)
       self.ffn = nn.Sequential(
           nn.Linear(embeds_size, 4 * embeds_size),
           nn.LeakyReLU(),
           nn.Linear(4 * embeds_size, embeds_size),
       )
       self.drop1 = nn.Dropout(drop_prob)
       self.drop2 = nn.Dropout(drop_prob)
       self.ln1 = nn.LayerNorm(embeds_size, eps=1e-6)
       self.ln2 = nn.LayerNorm(embeds_size, eps=1e-6)
   def forward(self, hidden_state):
       attn, _ = self.attention(hidden_state, hidden_state, hidden_state, need_weights=False)
       attn = self.drop1(attn)
       out = self.ln1(hidden_state + attn)
       observed = self.ffn(out)
       observed = self.drop2(observed)
       return self.ln2(out + observed)

●    hidden_state: A tensor with a shape (batch_size, block_size, embeds_size) goes to the transformer_block and a tensor with the same shape goes out of it.
●    self.attention: The transformer block tries to combine the information of tokens so that each token is aware of its neighbors or other tokens in the context. We may call this part the communication part. That’s what the nn.MultiheadAttention does. nn.MultiheadAttention is a ready multihead attention layer that can be faster than implementing it from scratch, just like what we did in the “Transformer Building Blocks” blog. 

The parameters of nn.MultiheadAttention are as follows:
     ○    embeds_size: token embedding size
     ○    num_heads: multihead, as the name suggests, consists of multiple heads and each head works on different parts of token embeddings. Suppose, your input data has shape (B,T,C) = (10, 32, 16). The token embedding size for this data is 16. If we specify the num_heads parameter to 2(divisible by 16), the multi-head splits data into two parts with shape (10, 32, 8). The first head works on the first part and the second head works on the second part. This is because transforming data into different subspaces can help the model to see different aspects of the data. Please note that the num_heads should be divisible by the embedding size so that at the end we can concatenate the split parts.
    ○    batch_first: True means the first dimension is batch.
●    Dropout: After the attention layer, the communication between tokens is closed and computations on tokens are done individually. We run a dropout on tokens. Dropout is a method of regularization. Regularization helps the training process to be based on generalization, not memorization. Without regularization, the model tries to memorize the training set and has poor performance on the test set. The dropout method turns off features with a probability of drop_prob.
●    self.ln1: Layer normalization normalizes embeddings so that they have zero mean and standard deviation one.
●    Residual connection: hidden_state + attn: Observe that before normalization, we added the input to the output of multihead attention, named residual connection. It has two benefits:
   ○    It helps the model to have the unchanged embedding information.
   ○    It helps to prevent gradient vanishing, which is common in deep networks where we stack multiple transformer layers.
●    self.ffn: After dropout, residual connection, and normalization, we forward data into a simple non-linear neural network to adjust the tokens one by one for better representation.
●    self.ln2(out + observed): Finally, another dropout, residual connection, and layer normalization.

The transformer block is ready. And here is the final piece:

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
class transformer(nn.Module):
   def __init__(self):
       super(transformer, self).__init__()
       self.tok_embs = nn.Embedding(vocab_size, embeds_size)
       self.pos_embs = nn.Embedding(block_size, embeds_size)
       self.block = block()
       self.ln1 = nn.LayerNorm(embeds_size)
       self.ln2 = nn.LayerNorm(embeds_size)
       self.classifier_head = nn.Sequential(
           nn.Linear(embeds_size, embeds_size),
           nn.LeakyReLU(),
           nn.Dropout(drop_prob),
           nn.Linear(embeds_size, embeds_size),
           nn.LeakyReLU(),
           nn.Linear(embeds_size, num_classes),
           nn.Softmax(dim=1),
       )
       print("number of parameters: %.2fM" % (self.num_params()/1e6,))
   def num_params(self):
       n_params = sum(p.numel() for p in self.parameters())
       return n_params
   def forward(self, seq):
       B,T = seq.shape
       embedded = self.tok_embs(seq)
       embedded = embedded + self.pos_embs(torch.arange(T, device=device))
       output = self.block(embedded)
       output = output.mean(dim=1)
       output = self.classifier_head(output)
       return output

●    self.tok_embs: nn.Embedding is like a lookup table that receives a sequence of indices, and returns their corresponding embeddings. These embeddings will receive gradients so that the model can update them to make better predictions.
●    self.tok_embs: To comprehend a sentence, you not only need words, you also need to have the order of words. Here, we embed positions and add them to the token embeddings. In this way, the model has both words and their order.
●    self.block: In this model, we only use one transformer block, but you can stack more blocks to get better results.
●    self.classifier_head: This is where we put the extracted information into action to classify the sequence. We call it the transformer head. It receives a fixed-size vector and classifies the sequence. The softmax as the final activation function returns a probability distribution for each class.
●    self.tok_embs(seq): Given a sequence of indices (batch_size, block_size), it returns (batch_size, block_size, embeds_size).
●    self.pos_embs(torch.arange(T, device=device)): Given a sequence of positions, i.e. [0,1,2], it returns embeddings of each position. Then, we add them to the token embeddings.
●    self.block(embedded): The embedding goes to the transformer block to extract features. Given the embedded shape (batch_size, block_size, embeds_size), the output has the same shape (batch_size, block_size, embeds_size).
●    output.mean(dim=1): The purpose of using mean is to aggregate the information from the sequence into a compact representation before feeding it into self.classifier_head. It helps in reducing the spatial dimensionality and extracting the most important features from the sequence. Given the input shape (batch_size, block_size, embeds_size), the output shape is (batch_size, embeds_size). So, one fixed-size vector for each batch.
●    self.classifier_head(output): And here we classify.

The final code can be found here. The remaining code consists of downstream tasks such as the training loop, loading the dataset, setting the hyperparameters, and optimizer. I used RMSprop instead of Adam and AdamW. I also used BCEWithLogitsLoss instead of cross-entropy loss. BCE(Binary Cross Entropy) is for binary classification models and it combines sigmoid with cross entropy and it is numerically more stable. I also empirically got better accuracy. After 30 epochs, the final accuracy is ~84%.

Conclusion

This exploration of text classification using transformers reveals their revolutionary potential. Beyond text generation, transformers excel in sentiment analysis. The encoder-decoder model, analogous to a painter interpreting tree feature, propels efficient text classification. A streamlined practical approach and the meticulously crafted transformer block enhance the architecture's robustness. Through optimization methods and loss functions, the model is honed, yielding an empirically validated 84% accuracy after 30 epochs. This journey highlights transformers' disruptive impact on reshaping AI-driven language comprehension, fundamentally altering the landscape of Natural Language Processing.

Author Bio

Saeed Dehqan trains language models from scratch. Currently, his work is centered around Language Models for text generation, and he possesses a strong understanding of the underlying concepts of neural networks. He is proficient in using optimizers such as genetic algorithms to fine-tune network hyperparameters and has experience with neural architecture search (NAS) by using reinforcement learning (RL). He implements models starting from data gathering to monitoring, and deployment on mobile, web, cloud, etc.