Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Mastering Transformers

You're reading from   Mastering Transformers The Journey from BERT to Large Language Models and Stable Diffusion

Arrow left icon
Product type Paperback
Published in Jun 2024
Publisher Packt
ISBN-13 9781837633784
Length 462 pages
Edition 2nd Edition
Languages
Tools
Concepts
Arrow right icon
Authors (2):
Arrow left icon
Savaş Yıldırım Savaş Yıldırım
Author Profile Icon Savaş Yıldırım
Savaş Yıldırım
Meysam Asgari- Chenaghlu Meysam Asgari- Chenaghlu
Author Profile Icon Meysam Asgari- Chenaghlu
Meysam Asgari- Chenaghlu
Arrow right icon
View More author details
Toc

Table of Contents (25) Chapters Close

Preface 1. Part 1: Recent Developments in the Field, Installations, and Hello World Applications
2. Chapter 1: From Bag-of-Words to the Transformers FREE CHAPTER 3. Chapter 2: A Hands-On Introduction to the Subject 4. Part 2: Transformer Models: From Autoencoders to Autoregressive Models
5. Chapter 3: Autoencoding Language Models 6. Chapter 4: From Generative Models to Large Language Models 7. Chapter 5: Fine-Tuning Language Models for Text Classification 8. Chapter 6: Fine-Tuning Language Models for Token Classification 9. Chapter 7: Text Representation 10. Chapter 8: Boosting Model Performance 11. Chapter 9: Parameter Efficient Fine-Tuning 12. Part 3: Advanced Topics
13. Chapter 10: Large Language Models 14. Chapter 11: Explainable AI (XAI) in NLP 15. Chapter 12: Working with Efficient Transformers 16. Chapter 13: Cross-Lingual and Multilingual Language Modeling 17. Chapter 14: Serving Transformer Models 18. Chapter 15: Model Tracking and Monitoring 19. Part 4: Transformers beyond NLP
20. Chapter 16: Vision Transformers 21. Chapter 17: Multimodal Generative Transformers 22. Chapter 18: Revisiting Transformers Architecture for Time Series 23. Index 24. Other Books You May Enjoy

Recalling traditional NLP approaches

Although traditional models will soon become obsolete, they can always shed light on innovative designs. The most important of these is the distributional method, which is still used. Distributional semantics is a theory that explains the meaning of a word by analyzing its distributional evidence rather than relying on predefined dictionary definitions or other static resources. This approach suggests that words that frequently occur in similar contexts tend to have similar meanings. For example, words such as dog and cat often occur in similar contexts, suggesting that they may have related meanings. The idea was first proposed by Zellig S. Harris in his work, Distributional Structure of Words, in 1954. One of the benefits of using a distributional approach is that it allows researchers to track the semantic evolution of words over time or across different domains or senses of words, which is a task that is not possible using dictionary definitions alone.

For many years, traditional approaches to NLP have relied on Bag-of-Words (BoW) and n-gram language models to understand words and sentences. BoW approaches, also known as vector space models (VSMs), represent words and documents using one-hot encoding, a sparse representation method. These one-hot encoding techniques have been used to solve a variety of NLP tasks, such as text classification, word similarity, semantic relation extraction, and word-sense disambiguation. On the other hand, n-gram language models assign probabilities to sequences of words, which can be used to calculate the likelihood that a given sequence belongs to a corpus or to generate random sequences based on a given corpus.

Along with these models, in order to assign importance to a term, the Term Frequency (TF) and the Inverse Document Frequency (IDF) metrics are often used. IDF helps to reduce the weight of high-frequency words, such as stop words and functional words, which have little discriminatory power in understanding the content of a document. The discriminatory power of a term also depends on the domain – for example, a list of articles about DL is likely to have the word network in almost every document. It would not be a surprise to see the term network in the domain data. The Document Frequency (DF) of a word is calculated by counting the number of documents in which it appears, and this can be used to scale down the weights of all terms. The TF is simply the raw count of a term in a document.

Some of the advantages and disadvantages of a TF-IDF-based BoW model are listed as follows:

Advantages

Disadvantages

  • Easy to implement
  • Human-interpretable results
  • Domain adaptation
  • Dimensionality curse
  • No solution for unseen words
  • Hardly captures semantic relations (is-a, has-a, synonym)
  • Ignores word order
  • Slow for large vocabulary

Table 1.1 – Advantages and disadvantages of a TF-IDF BoW model

Using a BoW approach to represent a small sentence can be impractical because it involves representing each word in the dictionary as a separate dimension in a vector, regardless of whether it appears in the sentence or not. This can result in a high-dimensional vector with many zero cells, making it difficult to work with and requiring a large amount of memory to store.

Latent semantic analysis (LSA) has been widely used to overcome the dimensionality problem of the BoW model. It is a linear method that captures pairwise correlations between terms. LSA-based probabilistic methods can still be considered as a single layer of hidden topic variables. However, current DL models include multiple hidden layers, with billions of parameters. In addition to that, Transformer-based models showed that they can discover latent representations much better than such traditional models.

The traditional pipeline for NLP tasks begins with a series of preparation steps, such as tokenization, stemming, noun phrase detection, chunking, and stop-word elimination. After these steps are completed, a document-term matrix is constructed using a weighting schema, with TF-IDF being the most popular choice. This matrix can then be used as input for various machine learning (ML) pipelines, including sentiment analysis, document similarity, document clustering, and measuring the relevancy score between a query and a document. Similarly, terms can be represented as a matrix and used as input for token classification tasks, such as named entity recognition and semantic relation extraction. The classification phase typically involves the application of supervised ML algorithms, such as support vector machines (SVMs), random forests, logistic regression, naive Bayes, and multiple learners (boosting or bagging).

Language modeling and generation

Traditional approaches to language generation tasks often rely on n-gram language models, also known as Markov processes. These are stochastic models that estimate the probability of a word (event) based on a subset of previous words. There are three main types of n-gram models: unigram, bigram, and n-gram (generalized). Let us look at these in more detail:

  • Unigram models assume that all words are independent and do not form a chain. The probability of a word in a vocabulary is simply calculated by its frequency in the total word count.
  • Bigram models, also known as first-order Markov processes, estimate the probability of a word based on the previous word. This probability is calculated by the ratio of the joint probability of two consecutive words to the probability of the first word.
  • N-gram models, also known as N-order Markov processes, estimate the probability of a word based on the previous n-1 words.

We have already discussed the paradigms underlying traditional NLP models in the Recalling traditional NLP approaches subsection and provided a brief introduction. We will now move on to discuss how neural language models have impacted the field of NLP and how they have addressed the limitations of traditional models.

You have been reading a chapter from
Mastering Transformers - Second Edition
Published in: Jun 2024
Publisher: Packt
ISBN-13: 9781837633784
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at £16.99/month. Cancel anytime