Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Mastering Transformers

You're reading from   Mastering Transformers The Journey from BERT to Large Language Models and Stable Diffusion

Arrow left icon
Product type Paperback
Published in Jun 2024
Publisher Packt
ISBN-13 9781837633784
Length 462 pages
Edition 2nd Edition
Languages
Tools
Concepts
Arrow right icon
Authors (2):
Arrow left icon
Savaş Yıldırım Savaş Yıldırım
Author Profile Icon Savaş Yıldırım
Savaş Yıldırım
Meysam Asgari- Chenaghlu Meysam Asgari- Chenaghlu
Author Profile Icon Meysam Asgari- Chenaghlu
Meysam Asgari- Chenaghlu
Arrow right icon
View More author details
Toc

Table of Contents (25) Chapters Close

Preface 1. Part 1: Recent Developments in the Field, Installations, and Hello World Applications
2. Chapter 1: From Bag-of-Words to the Transformers FREE CHAPTER 3. Chapter 2: A Hands-On Introduction to the Subject 4. Part 2: Transformer Models: From Autoencoders to Autoregressive Models
5. Chapter 3: Autoencoding Language Models 6. Chapter 4: From Generative Models to Large Language Models 7. Chapter 5: Fine-Tuning Language Models for Text Classification 8. Chapter 6: Fine-Tuning Language Models for Token Classification 9. Chapter 7: Text Representation 10. Chapter 8: Boosting Model Performance 11. Chapter 9: Parameter Efficient Fine-Tuning 12. Part 3: Advanced Topics
13. Chapter 10: Large Language Models 14. Chapter 11: Explainable AI (XAI) in NLP 15. Chapter 12: Working with Efficient Transformers 16. Chapter 13: Cross-Lingual and Multilingual Language Modeling 17. Chapter 14: Serving Transformer Models 18. Chapter 15: Model Tracking and Monitoring 19. Part 4: Transformers beyond NLP
20. Chapter 16: Vision Transformers 21. Chapter 17: Multimodal Generative Transformers 22. Chapter 18: Revisiting Transformers Architecture for Time Series 23. Index 24. Other Books You May Enjoy

Evolution of NLP approaches

Let us discuss how NLP has evolved over the past two decades as significant advancements have occurred in the field. Recently, the evolution has been primarily characterized by the transformative Transformer architecture. This architecture did not emerge out of thin air but, rather, evolved from various neural-based NLP approaches into an attention-based encoder-decoder architecture, which continues to evolve. In the past decade, the Transformer architecture and its variants have gained popularity due to the following developments:

  • Contextual word embeddings thanks to self-attention
  • Attention mechanisms, which overcome the problem of forcing input sentences to encode all information into one context vector
  • Better subword tokenization algorithms for handling unseen words or rare words
  • Injecting additional memory tokens into sentences, such as the paragraph ID in Doc2vec or a Classification ([CLS]) token in Bidirectional Encoder Representations from Transformers (BERT)
  • Parallelizable architectures that make for faster training and fine-tuning
  • Model compression (distillation, quantization, and so on)
  • TL capabilities: DL models can be easily adapted to new tasks or languages
  • Cross-lingual, multilingual, and multitasking learning capabilities
  • Multimodal training

For several years, traditional NLP approaches such as n-gram language models, TF-IDF-based models, BM25 information retrieval models, and one-hot encoded document-term matrices have been utilized to solve a range of NLP tasks, including sequence classification, language generation, and machine translation. However, these traditional approaches have limitations, such as difficulty in handling sparsity, rare words, and long-term dependencies. To address these challenges, we have developed DL-based approaches such as RNN, CNN, and FFNN, as well as several variants.

A document vector created by the TF-IDF model would have a size of more than 30,000 features. In 2013, word2vec, a two-layer FFNN word-encoder model, sorted out the dimensionality curse by generating compact and dense representations of words called word embeddings. This early model was able to create fast and efficient static word embeddings by converting unsupervised textual data into supervised data (self-supervised learning) through the prediction of target words using nearby neighboring words. Meanwhile, GloVe, another widely used and popular model, argued that count-based models can be better than neural models. It leverages both global and local statistics of a corpus to learn embeddings based on word-word co-occurrence statistics.

Word embeddings have proven effective for some syntactic and semantic tasks, as demonstrated in the following figure, which illustrates how the offsets between terms in the embeddings allow for vector-oriented reasoning. For example, we can discern the generalization of gender relations, a semantic relation, from the offset between the terms Man and Woman (Man -> Woman). Then, we can arithmetically estimate the vector of Actress by adding the vector of the term Actor and the offset calculated before. Likewise, we can learn the syntactic relationships, such as word plural forms. For instance, if the vectors of Actor, Actors, and Actress are given, we can estimate the vector of Actresses:

Figure 1.2 – Word embeddings offset for relationship extraction

Figure 1.2 – Word embeddings offset for relationship extraction

The representativeness of word embeddings has become a fundamental component for DL models such as RNNs or CNNs. The recurrent and convolutional architectures, respectively, started to be used as encoders and decoders in sequence-to-sequence (seq2seq) problems where each token is represented with an embedding. The main challenge with these early models was polysemous words (words with more than one meaning). The senses of the words are ignored since a single fixed representation is assigned to each word, which is an especially severe problem for sentence semantics.

Pioneering neural network models such as Universal Language Model Fine-tuning (ULMFiT) and Embeddings from Language Models (ELMo) were able to encode sentence-level information and alleviate polysemy issues, unlike static word embeddings. In this way, we have a new concept called contextual word embeddings.

The ULMFiT and ELMo approaches were based on LSTM (Long Short-Term Memory) networks, a variant of RNN. They also exploited the concept of pre-training and fine-tuning, which allows for TL by using pre-trained models trained on a general task with general textual datasets and fine-tuning them on a target task with supervision. This was a significant development because TL, which had previously been successful in image processing, was being applied for the first time in the field of NLP.

In the meantime, the idea of an attention mechanism (Neural Machine Translation by Jointly Learning to Align and Translate. Bahdanau et al., 2015) made a strong impression in the NLP field and achieved significant success, especially in seq2seq problems. Earlier methods would pass the last state (known as a context vector or thought vector) obtained from the entire input sequence to the output sequence without linking or elimination. Thanks to the attention mechanism, certain parts of the input can be associated with certain parts of the output.

In 2017, the Transformer-based encoder-decoder model was introduced and found to be successful due to its innovative use of the attention mechanism. The design is based on an FFNN by discarding RNN recurrency and using only attention mechanisms (Vaswani et al., Attention is All You Need, 2017). It overcame many difficulties that other approaches faced and has become a new paradigm. Throughout this book, you will be exploring and understanding how Transformer-based models work.

You have been reading a chapter from
Mastering Transformers - Second Edition
Published in: Jun 2024
Publisher: Packt
ISBN-13: 9781837633784
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime