You're reading from Mastering Transformers The Journey from BERT to Large Language Models and Stable Diffusion

Product type Paperback

Published in Jun 2024

Publisher Packt

ISBN-13 9781837633784

Length 462 pages

Edition 2nd Edition

Languages

Python

Tools

BERT

Concepts

GPT/LLMs

Authors (2):

Savaş Yıldırım

Meysam Asgari- Chenaghlu

View More author details

Table of Contents (25) Chapters

Preface

1. Part 1: Recent Developments in the Field, Installations, and Hello World Applications

2. Chapter 1: From Bag-of-Words to the Transformers FREE CHAPTER

3. Chapter 2: A Hands-On Introduction to the Subject

4. Part 2: Transformer Models: From Autoencoders to Autoregressive Models

5. Chapter 3: Autoencoding Language Models

6. Chapter 4: From Generative Models to Large Language Models

7. Chapter 5: Fine-Tuning Language Models for Text Classification

8. Chapter 6: Fine-Tuning Language Models for Token Classification

9. Chapter 7: Text Representation

10. Chapter 8: Boosting Model Performance

11. Chapter 9: Parameter Efficient Fine-Tuning

12. Part 3: Advanced Topics

13. Chapter 10: Large Language Models

14. Chapter 11: Explainable AI (XAI) in NLP

15. Chapter 12: Working with Efficient Transformers

16. Chapter 13: Cross-Lingual and Multilingual Language Modeling

17. Chapter 14: Serving Transformer Models

18. Chapter 15: Model Tracking and Monitoring

19. Part 4: Transformers beyond NLP

20. Chapter 16: Vision Transformers

21. Chapter 17: Multimodal Generative Transformers

22. Chapter 18: Revisiting Transformers Architecture for Time Series

23. Index

Why subscribe?

24. Other Books You May Enjoy

Evolution of NLP approaches

Let us discuss how NLP has evolved over the past two decades as significant advancements have occurred in the field. Recently, the evolution has been primarily characterized by the transformative Transformer architecture. This architecture did not emerge out of thin air but, rather, evolved from various neural-based NLP approaches into an attention-based encoder-decoder architecture, which continues to evolve. In the past decade, the Transformer architecture and its variants have gained popularity due to the following developments:

Contextual word embeddings thanks to self-attention
Attention mechanisms, which overcome the problem of forcing input sentences to encode all information into one context vector
Better subword tokenization algorithms for handling unseen words or rare words
Injecting additional memory tokens into sentences, such as the paragraph ID in Doc2vec or a Classification ([CLS]) token in Bidirectional Encoder Representations from Transformers (BERT)
Parallelizable architectures that make for faster training and fine-tuning
Model compression (distillation, quantization, and so on)
TL capabilities: DL models can be easily adapted to new tasks or languages
Cross-lingual, multilingual, and multitasking learning capabilities
Multimodal training

For several years, traditional NLP approaches such as n-gram language models, TF-IDF-based models, BM25 information retrieval models, and one-hot encoded document-term matrices have been utilized to solve a range of NLP tasks, including sequence classification, language generation, and machine translation. However, these traditional approaches have limitations, such as difficulty in handling sparsity, rare words, and long-term dependencies. To address these challenges, we have developed DL-based approaches such as RNN, CNN, and FFNN, as well as several variants.

A document vector created by the TF-IDF model would have a size of more than 30,000 features. In 2013, word2vec, a two-layer FFNN word-encoder model, sorted out the dimensionality curse by generating compact and dense representations of words called word embeddings. This early model was able to create fast and efficient static word embeddings by converting unsupervised textual data into supervised data (self-supervised learning) through the prediction of target words using nearby neighboring words. Meanwhile, GloVe, another widely used and popular model, argued that count-based models can be better than neural models. It leverages both global and local statistics of a corpus to learn embeddings based on word-word co-occurrence statistics.

Word embeddings have proven effective for some syntactic and semantic tasks, as demonstrated in the following figure, which illustrates how the offsets between terms in the embeddings allow for vector-oriented reasoning. For example, we can discern the generalization of gender relations, a semantic relation, from the offset between the terms Man and Woman (Man -> Woman). Then, we can arithmetically estimate the vector of Actress by adding the vector of the term Actor and the offset calculated before. Likewise, we can learn the syntactic relationships, such as word plural forms. For instance, if the vectors of Actor, Actors, and Actress are given, we can estimate the vector of Actresses:

Figure 1.2 – Word embeddings offset for relationship extraction

The representativeness of word embeddings has become a fundamental component for DL models such as RNNs or CNNs. The recurrent and convolutional architectures, respectively, started to be used as encoders and decoders in sequence-to-sequence (seq2seq) problems where each token is represented with an embedding. The main challenge with these early models was polysemous words (words with more than one meaning). The senses of the words are ignored since a single fixed representation is assigned to each word, which is an especially severe problem for sentence semantics.

Pioneering neural network models such as Universal Language Model Fine-tuning (ULMFiT) and Embeddings from Language Models (ELMo) were able to encode sentence-level information and alleviate polysemy issues, unlike static word embeddings. In this way, we have a new concept called contextual word embeddings.

The ULMFiT and ELMo approaches were based on LSTM (Long Short-Term Memory) networks, a variant of RNN. They also exploited the concept of pre-training and fine-tuning, which allows for TL by using pre-trained models trained on a general task with general textual datasets and fine-tuning them on a target task with supervision. This was a significant development because TL, which had previously been successful in image processing, was being applied for the first time in the field of NLP.

In the meantime, the idea of an attention mechanism (Neural Machine Translation by Jointly Learning to Align and Translate. Bahdanau et al., 2015) made a strong impression in the NLP field and achieved significant success, especially in seq2seq problems. Earlier methods would pass the last state (known as a context vector or thought vector) obtained from the entire input sequence to the output sequence without linking or elimination. Thanks to the attention mechanism, certain parts of the input can be associated with certain parts of the output.

In 2017, the Transformer-based encoder-decoder model was introduced and found to be successful due to its innovative use of the attention mechanism. The design is based on an FFNN by discarding RNN recurrency and using only attention mechanisms (Vaswani et al., Attention is All You Need, 2017). It overcame many difficulties that other approaches faced and has become a new paradigm. Throughout this book, you will be exploring and understanding how Transformer-based models work.

You're reading from Mastering Transformers The Journey from BERT to Large Language Models and Stable Diffusion

Table of Contents (25) Chapters

Evolution of NLP approaches

Authors (2)

Personalised recommendations for you