Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Mastering Transformers

You're reading from   Mastering Transformers The Journey from BERT to Large Language Models and Stable Diffusion

Arrow left icon
Product type Paperback
Published in Jun 2024
Publisher Packt
ISBN-13 9781837633784
Length 462 pages
Edition 2nd Edition
Languages
Tools
Concepts
Arrow right icon
Authors (2):
Arrow left icon
Savaş Yıldırım Savaş Yıldırım
Author Profile Icon Savaş Yıldırım
Savaş Yıldırım
Meysam Asgari- Chenaghlu Meysam Asgari- Chenaghlu
Author Profile Icon Meysam Asgari- Chenaghlu
Meysam Asgari- Chenaghlu
Arrow right icon
View More author details
Toc

Table of Contents (25) Chapters Close

Preface 1. Part 1: Recent Developments in the Field, Installations, and Hello World Applications
2. Chapter 1: From Bag-of-Words to the Transformers FREE CHAPTER 3. Chapter 2: A Hands-On Introduction to the Subject 4. Part 2: Transformer Models: From Autoencoders to Autoregressive Models
5. Chapter 3: Autoencoding Language Models 6. Chapter 4: From Generative Models to Large Language Models 7. Chapter 5: Fine-Tuning Language Models for Text Classification 8. Chapter 6: Fine-Tuning Language Models for Token Classification 9. Chapter 7: Text Representation 10. Chapter 8: Boosting Model Performance 11. Chapter 9: Parameter Efficient Fine-Tuning 12. Part 3: Advanced Topics
13. Chapter 10: Large Language Models 14. Chapter 11: Explainable AI (XAI) in NLP 15. Chapter 12: Working with Efficient Transformers 16. Chapter 13: Cross-Lingual and Multilingual Language Modeling 17. Chapter 14: Serving Transformer Models 18. Chapter 15: Model Tracking and Monitoring 19. Part 4: Transformers beyond NLP
20. Chapter 16: Vision Transformers 21. Chapter 17: Multimodal Generative Transformers 22. Chapter 18: Revisiting Transformers Architecture for Time Series 23. Index 24. Other Books You May Enjoy

Overview of the Transformer architecture

Transformer models have received immense interest because of their effectiveness in an enormous range of NLP problems, from text classification to text generation. The attention mechanism is an important part of these models and plays a very crucial role. Before Transformer models, the attention mechanism was proposed as a helper for improving conventional DL models such as RNNs. To understand Transformers and their impact on NLP, we will first study the attention mechanism.

Attention mechanism

The attention mechanism allowed for the creation of a more advanced model by connecting specific tokens in the input sequence to specific tokens in the output sequence. For instance, suppose you have the keyword phrase Canadian Government in the input sentence for an English-to-Turkish translation task. In the output sentence, the Kanada Hükümeti tokens make strong connections with the input phrase and establish a weaker connection with the remaining words in the input, as illustrated in the following figure:

Figure 1.5 – Sketchy visualization of an attention mechanism

Figure 1.5 – Sketchy visualization of an attention mechanism

This mechanism makes models more successful in seq2seq tasks such as translation, question answering, and text summarization.

One of the first variations of the attention mechanism was proposed by Bahdanau et al. (2015). This mechanism is based on the fact that RNN-based models such as GRUs or LSTMs have an information bottleneck on tasks such as neural machine translation (NMT). These encoder-decoder-based models get the input in the form of a token ID and process it in a recurrent fashion (encoder). Afterward, the processed intermediate representation is fed into another recurrent unit (decoder) to extract the results. This avalanche-like information is like a rolling ball that consumes all the information, and rolling it out is hard for the decoder part because it does not see all the dependencies and only gets the intermediate representation (context vector) as input.

To align this mechanism, Bahdanau proposed an attention mechanism to use weights on intermediate hidden values. These weights align the amount of attention that a model must pay to the input in each decoding step. Such wonderful guidance assists models in specific tasks such as NMT, which is a many-to-many task. Different attention mechanisms have been proposed with different improvements. Additive, multiplicative, general, and dot-product attention appear within the family of these mechanisms. The latter, which is a modified version with a scaling parameter, is noted as scaled dot-product attention. This specific attention type is the foundation of Transformers models and is called a multi-head attention mechanism. Additive attention is also what was introduced earlier as a notable change in NMT tasks. You can see an overview of the different types of attention mechanisms here:

Name

Attention score function

Content-based attention

Additive

Location-based

General

Dot-product

Scaled dot-product

Table 1.2 – Types of attention mechanisms

In Table 1.2, h and s represent the hidden state and state itself, respectively, while W denotes the weights specific to the attention mechanism.

Since attention mechanisms are not specific to NLP; they are also used in different use cases in various fields, from computer vision to speech recognition. The following figure shows a visualization of a multimodal approach trained for neural image captioning (K Xu et al., Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, 2015):

Figure 1.6 – Attention in Computer Vision

Figure 1.6 – Attention in Computer Vision

Another example is as follows:

Figure 1.7 – Another example for Attention mechanism in computer vision

Figure 1.7 – Another example for Attention mechanism in computer vision

Next, let’s understand multi-head attention mechanisms.

Multi-head attention mechanisms

The multi-head attention mechanism that is shown in the following diagram is an essential part of the Transformer architecture:

Figure 1.8 – Multi-head attention mechanism

Figure 1.8 – Multi-head attention mechanism

Before jumping into scaled dot-product attention mechanisms, it’s better to get a good understanding of self-attention. Self-attention, as shown in Figure 1.8, is a basic form of a scaled self-attention mechanism. This mechanism uses an input matrix and produces an attention score between various items

Q in Figure 1.9 is known as the query, K is known as the key, and V is noted as the value. Three types of matrices shown as theta, phi, and g are multiplied by X before producing Q, K, and V. The multiplied result between the query (Q) and key (K) yields an attention score matrix. This can also be seen as a database where we use the query and keys in order to find out how various items are related in terms of numeric evaluation. Multiplication of the attention score and the V matrix produces the final result of this type of attention mechanism. The main reason for it being called self-attention is because of its unified input X (Q, K, and V are computed from X). You can see all this depicted in the following diagram:

Figure 1.9 – Mathematical representation for the attention mechanism (diagram inspiration from https://blogs.oracle.com/datascience/multi-head-self-attention-in-nlp)

Figure 1.9 – Mathematical representation for the attention mechanism (diagram inspiration from https://blogs.oracle.com/datascience/multi-head-self-attention-in-nlp)

A scaled dot-product attention mechanism is very similar to a self-attention (dot-product) mechanism except it uses a scaling factor. The multi-head part, on the other hand, ensures the model can look at various aspects of input at all levels. Transformer models attend to encoder annotations and the hidden values from past layers. The architecture of the Transformer model does not have a recurrent step-by-step flow; instead, it uses positional encoding to receive information about the position of each token in the input sequence. The concatenated values of the embeddings (randomly initialized) and the fixed values of positional encoding are the input fed into the layers in the first encoder part and are propagated through the architecture, as illustrated in the following diagram:

Figure 1.10 – A Transformer architecture

Figure 1.10 – A Transformer architecture

The positional information is obtained by evaluating sine and cosine waves at different frequencies. An example of positional encoding is visualized in the following figure:

Figure 1.11 – Positional encoding

Figure 1.11 – Positional encoding

A good example of performance for the Transformer architecture and the scaled dot-product attention mechanism is given in the following figure:

Figure 1.12 – Attention mapping for Transformers

Figure 1.12 – Attention mapping for Transformers

The word it refers to different entities in different contexts. As seen in Figure 1.12, it refers to cat. If we changed angry to stale, it would refer to food. Another improvement made by using a Transformer architecture is in parallelism. Conventional sequential recurrent models such as LSTMs and GRUs do not have such capabilities because they process the input token by token. Feed-forward layers, on the other hand, speed up a bit more because single matrix multiplication is far faster than a recurrent unit. Stacks of multi-head attention layers gain a better understanding of complex sentences.

On the decoder side of the attention mechanism, a very similar approach to the encoder is utilized with small modifications. A multi-head attention mechanism is the same, but the output of the encoder stack is also used. This encoding is given to each decoder stack in the second multi-head attention layer. This little modification introduces the output of the encoder stack while decoding. This modification lets the model be aware of the encoder output while decoding and, at the same time, helps it during training to have a better gradient flow over various layers. The final softmax layer at the end of the decoder layer is used to provide outputs for various use cases such as NMT, for which the original Transformer architecture was introduced.

This architecture has two inputs, noted as inputs and outputs (shifted right). One is always present (inputs) in both training and inference, while the other is just present in training and inference, which is produced by the model. The reason we do not use model predictions in inference is to stop the model from going too wrong by itself. But what does it mean? Imagine a neural translation model trying to translate a sentence from English to French—at each step, it makes a prediction for a word, and it uses that predicted word to predict the next one. But if it goes wrong at some step, all the subsequent predictions will be wrong, too. To stop the model from going wrong like this, we provide the correct words as a shifted-right version.

A visual example of a Transformer model is given in the following diagram. It shows a Transformer model with two encoders and two decoder layers. The Add & Normalize layer from this diagram adds and normalizes the input it takes from the Feed Forward layer:

Figure 1.13 – Transformer model (inspiration from http://jalammar.github.io/illustrated-Transformer/)

Figure 1.13 – Transformer model (inspiration from http://jalammar.github.io/illustrated-Transformer/)

Another major improvement that is used by a Transformer-based architecture is based on a simple universal text compression scheme to prevent unseen tokens on the input side. This approach, which takes place by using different methods such as byte-pair encoding and sentence-piece encoding, improves a Transformer’s performance in dealing with unseen tokens. It also guides the model when the model encounters morphologically close tokens. Such tokens were unseen in the past and are rarely used in the training, and yet, an inference might be seen. In some cases, chunks of it are seen in training; the latter happens in the case of morphologically rich languages such as Turkish, German, Czech, and Latvian. For example, a model might see the word training but not trainings. In such cases, it can tokenize trainings as training+s. These two are commonly seen when we look at them as two parts.

Transformer-based models have quite common characteristics—for example, they are all based on this original architecture with differences in the steps they use and don’t use. In some cases, minor differences are made—for example, improvements to the multi-head attention mechanism taking place.

Now, we will discuss how to apply TL within Transformers in the following section.

You have been reading a chapter from
Mastering Transformers - Second Edition
Published in: Jun 2024
Publisher: Packt
ISBN-13: 9781837633784
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime