You're reading from Mastering Transformers The Journey from BERT to Large Language Models and Stable Diffusion

Product type Paperback

Published in Jun 2024

Publisher Packt

ISBN-13 9781837633784

Length 462 pages

Edition 2nd Edition

Languages

Python

Tools

BERT

Concepts

GPT/LLMs

Authors (2):

Savaş Yıldırım

Meysam Asgari- Chenaghlu

View More author details

Table of Contents (25) Chapters

Preface

1. Part 1: Recent Developments in the Field, Installations, and Hello World Applications

2. Chapter 1: From Bag-of-Words to the Transformers FREE CHAPTER

3. Chapter 2: A Hands-On Introduction to the Subject

4. Part 2: Transformer Models: From Autoencoders to Autoregressive Models

5. Chapter 3: Autoencoding Language Models

6. Chapter 4: From Generative Models to Large Language Models

7. Chapter 5: Fine-Tuning Language Models for Text Classification

8. Chapter 6: Fine-Tuning Language Models for Token Classification

9. Chapter 7: Text Representation

10. Chapter 8: Boosting Model Performance

11. Chapter 9: Parameter Efficient Fine-Tuning

12. Part 3: Advanced Topics

13. Chapter 10: Large Language Models

14. Chapter 11: Explainable AI (XAI) in NLP

15. Chapter 12: Working with Efficient Transformers

16. Chapter 13: Cross-Lingual and Multilingual Language Modeling

17. Chapter 14: Serving Transformer Models

18. Chapter 15: Model Tracking and Monitoring

19. Part 4: Transformers beyond NLP

20. Chapter 16: Vision Transformers

21. Chapter 17: Multimodal Generative Transformers

22. Chapter 18: Revisiting Transformers Architecture for Time Series

23. Index

Why subscribe?

24. Other Books You May Enjoy

Recalling traditional NLP approaches

Although traditional models will soon become obsolete, they can always shed light on innovative designs. The most important of these is the distributional method, which is still used. Distributional semantics is a theory that explains the meaning of a word by analyzing its distributional evidence rather than relying on predefined dictionary definitions or other static resources. This approach suggests that words that frequently occur in similar contexts tend to have similar meanings. For example, words such as dog and cat often occur in similar contexts, suggesting that they may have related meanings. The idea was first proposed by Zellig S. Harris in his work, Distributional Structure of Words, in 1954. One of the benefits of using a distributional approach is that it allows researchers to track the semantic evolution of words over time or across different domains or senses of words, which is a task that is not possible using dictionary definitions alone.

For many years, traditional approaches to NLP have relied on Bag-of-Words (BoW) and n-gram language models to understand words and sentences. BoW approaches, also known as vector space models (VSMs), represent words and documents using one-hot encoding, a sparse representation method. These one-hot encoding techniques have been used to solve a variety of NLP tasks, such as text classification, word similarity, semantic relation extraction, and word-sense disambiguation. On the other hand, n-gram language models assign probabilities to sequences of words, which can be used to calculate the likelihood that a given sequence belongs to a corpus or to generate random sequences based on a given corpus.

Along with these models, in order to assign importance to a term, the Term Frequency (TF) and the Inverse Document Frequency (IDF) metrics are often used. IDF helps to reduce the weight of high-frequency words, such as stop words and functional words, which have little discriminatory power in understanding the content of a document. The discriminatory power of a term also depends on the domain – for example, a list of articles about DL is likely to have the word network in almost every document. It would not be a surprise to see the term network in the domain data. The Document Frequency (DF) of a word is calculated by counting the number of documents in which it appears, and this can be used to scale down the weights of all terms. The TF is simply the raw count of a term in a document.

Some of the advantages and disadvantages of a TF-IDF-based BoW model are listed as follows:

Advantages	Disadvantages
Easy to implement Human-interpretable results Domain adaptation	Dimensionality curse No solution for unseen words Hardly captures semantic relations (is-a, has-a, synonym) Ignores word order Slow for large vocabulary

Table 1.1 – Advantages and disadvantages of a TF-IDF BoW model

Using a BoW approach to represent a small sentence can be impractical because it involves representing each word in the dictionary as a separate dimension in a vector, regardless of whether it appears in the sentence or not. This can result in a high-dimensional vector with many zero cells, making it difficult to work with and requiring a large amount of memory to store.

Latent semantic analysis (LSA) has been widely used to overcome the dimensionality problem of the BoW model. It is a linear method that captures pairwise correlations between terms. LSA-based probabilistic methods can still be considered as a single layer of hidden topic variables. However, current DL models include multiple hidden layers, with billions of parameters. In addition to that, Transformer-based models showed that they can discover latent representations much better than such traditional models.

The traditional pipeline for NLP tasks begins with a series of preparation steps, such as tokenization, stemming, noun phrase detection, chunking, and stop-word elimination. After these steps are completed, a document-term matrix is constructed using a weighting schema, with TF-IDF being the most popular choice. This matrix can then be used as input for various machine learning (ML) pipelines, including sentiment analysis, document similarity, document clustering, and measuring the relevancy score between a query and a document. Similarly, terms can be represented as a matrix and used as input for token classification tasks, such as named entity recognition and semantic relation extraction. The classification phase typically involves the application of supervised ML algorithms, such as support vector machines (SVMs), random forests, logistic regression, naive Bayes, and multiple learners (boosting or bagging).

Language modeling and generation

Traditional approaches to language generation tasks often rely on n-gram language models, also known as Markov processes. These are stochastic models that estimate the probability of a word (event) based on a subset of previous words. There are three main types of n-gram models: unigram, bigram, and n-gram (generalized). Let us look at these in more detail:

Unigram models assume that all words are independent and do not form a chain. The probability of a word in a vocabulary is simply calculated by its frequency in the total word count.
Bigram models, also known as first-order Markov processes, estimate the probability of a word based on the previous word. This probability is calculated by the ratio of the joint probability of two consecutive words to the probability of the first word.
N-gram models, also known as N-order Markov processes, estimate the probability of a word based on the previous n-1 words.

We have already discussed the paradigms underlying traditional NLP models in the Recalling traditional NLP approaches subsection and provided a brief introduction. We will now move on to discuss how neural language models have impacted the field of NLP and how they have addressed the limitations of traditional models.

You're reading from Mastering Transformers The Journey from BERT to Large Language Models and Stable Diffusion

Table of Contents (25) Chapters

Recalling traditional NLP approaches

Language modeling and generation

Authors (2)

Personalised recommendations for you