You're reading from Building LLM Powered Applications

Product type Book

Published in May 2024

Publisher Packt

ISBN-13 9781835462317

Pages 342 pages

Edition 1st Edition

Languages

Python

Concepts

Artificial Intelligence

Author (1):

Valentina Alto

Table of Contents (16) Chapters

Preface

1. Introduction to Large Language Models

2. LLMs for AI-Powered Applications

3. Choosing an LLM for Your Application

4. Prompt Engineering

5. Embedding LLMs within Your Applications

6. Building Conversational Applications

7. Search and Recommendation Engines with LLMs

8. Using LLMs with Structured Data

9. Working with Code

10. Building Multimodal Applications with LLMs

11. Fine-Tuning Large Language Models

12. Responsible AI

13. Emerging Trends and Innovations

14. Other Books You May Enjoy

15. Index

Training and evaluating LLMs

In the preceding sections, we saw how choosing an LLM architecture is a pivotal step in determining its functioning. However, the quality and diversity of the output text depend largely on two factors: the training dataset and the evaluation metric.

The training dataset determines what kind of data the LLM learns from and how well it can generalize to new domains and languages. The evaluation metric measures how well the LLM performs on specific tasks and benchmarks, and how it compares to other models and human writers. Therefore, choosing an appropriate training dataset and evaluation metric is crucial for developing and assessing LLMs.

In this section, we will discuss some of the challenges and trade-offs involved in selecting and using different training datasets and evaluation metrics for LLMs, as well as some of the recent developments and future directions in this area.

Training an LLM

By definition, LLMs are huge, from a double point of view:

Number of parameters: This is a measure of the complexity of the LLM architecture and represents the number of connections among neurons. Complex architectures have thousands of layers, each one having multiple neurons, meaning that among layers, we will have several connections with associated parameters (or weights).
Training set: This refers to the unlabeled text corpus on which the LLM learns and trains its parameters. To give an idea of how big such a text corpus for an LLM can be, let’s consider OpenAI’s GPT-3 training set:

Table

Description automatically generated

Figure 1.11: GPT-3 knowledge base

Considering the assumption:

1 token ~= 4 characters in English
1 token ~= ¾ words

We can conclude that GPT-3 has been trained on around 374 billion words.

So generally speaking, LLMs are trained using unsupervised learning on massive datasets, which often consist of billions of sentences collected from diverse sources on the internet. The transformer architecture, with its self-attention mechanism, allows the model to efficiently process long sequences of text and capture intricate dependencies between words. Training such models necessitates vast computational resources, typically employing distributed systems with multiple graphics processing units (GPUs) or tensor processing units (TPUs).

Definition

A tensor is a multi-dimensional array used in mathematics and computer science. It holds numerical data and is fundamental in fields like machine learning.

A TPU is a specialized hardware accelerator created by Google for deep learning tasks. TPUs are optimized for tensor operations, making them highly efficient for training and running neural networks. They offer fast processing while consuming less power, enabling faster model training and inference in data centers.

The training process involves numerous iterations over the dataset, fine-tuning the model’s parameters using optimization algorithms backpropagation. Through this process, transformer-based language models acquire a deep understanding of language patterns, semantics, and context, enabling them to excel in a wide range of NLP tasks, from text generation to sentiment analysis and machine translation.

The following are the main steps involved in the training process of an LLM:

Data collection: This is the process of gathering a large amount of text data from various sources, such as the open web, books, news articles, social media, etc. The data should be diverse, high-quality, and representative of the natural language that the LLM will encounter.
Data preprocessing: This is the process of cleaning, filtering, and formatting the data for training. This may include removing duplicates, noise, or sensitive information, splitting the data into sentences or paragraphs, tokenizing the text into subwords or characters, etc.
Model architecture: This is the process of designing the structure and parameters of the LLM. This may include choosing the type of neural network (such as transformer) and its structure (such as decoder only, encoder only, or encoder-decoder), the number and size of layers, the attention mechanism, the activation function, etc.
Model initialization: This is the process of assigning initial values to the weights and biases of the LLM. This may be done randomly or by using pre-trained weights from another model.
Model pre-training: This is the process of updating the weights and biases of the LLM by feeding it batches of data and computing the loss function. The loss function measures how well the LLM predicts the next token given the previous tokens. The LLM tries to minimize the loss by using an optimization algorithm (such as gradient descent) that adjusts the weights and biases in the direction that reduces the loss with the backpropagation mechanism. The model training may take several epochs (iterations over the entire dataset) until it converges to a low loss value.

Definition

In the context of neural networks, the optimization algorithm during training is the method used to find the best set of weights for the model that minimizes the prediction error or maximizes the accuracy of the training data. The most common optimization algorithm for neural networks is stochastic gradient descent (SGD), which updates the weights in small steps based on the gradient of the error function and the current input-output pair. SGD is often combined with backpropagation, which we defined earlier in this chapter.

The output of the pre-training phase is the so-called base model.

Fine-tuning: The base model is trained in a supervised way with a dataset made of tuples of (prompt, ideal response). This step is necessary to make the base model more in line with AI assistants, such as ChatGPT. The output of this phase is called the supervised fine-tuned (SFT) model.
Reinforcement learning from human feedback (RLHF): This step consists of iteratively optimizing the SFT model (by updating some of its parameters) with respect to the reward model (typically another LLM trained incorporating human preferences).

Definition

Reinforcement learning (RL) is a branch of machine learning that focuses on training computers to make optimal decisions by interacting with their environment. Instead of being given explicit instructions, the computer learns through trial and error: by exploring the environment and receiving rewards or penalties for its actions. The goal of reinforcement learning is to find the optimal behavior or policy that maximizes the expected reward or value of a given model. To do so, the RL process involves a reward model (RM) that is able to provide a “preferability score” to the computer. In the context of RLHF, the RM is trained to incorporate human preferences.

Note that RLHF is a pivotal milestone in achieving human alignment with AI systems. Due to the rapid achievements in the field of generative AI, it is pivotal to keep endowing those powerful LLMs and, more generally, LFMs with those preferences and values that are typical of human beings.

Once we have a trained model, the next and final step is evaluating its performance.

Model evaluation

Evaluating traditional AI models was, in some ways, pretty intuitive. For example, let’s think about an image classification model that has to determine whether the input image represents a dog or a cat. So we train our model on a training dataset with a set of labeled images and, once the model is trained, we test it on unlabeled images. The evaluation metric is simply the percentage of correctly classified images over the total number of images within the test set.

When it comes to LLMs, the story is a bit different. As those models are trained on unlabeled text and are not task-specific, but rather generic and adaptable given a user’s prompt, traditional evaluation metrics were not suitable anymore. Evaluating an LLM means, among other things, measuring its language fluency, coherence, and ability to emulate different styles depending on the user’s request.

Hence, a new set of evaluation frameworks needed to be introduced. The following are the most popular frameworks used to evaluate LLMs:

General Language Understanding Evaluation (GLUE) and SuperGLUE: This benchmark is used to measure the performance of LLMs on various NLU tasks, such as sentiment analysis, natural language inference, question answering, etc. The higher the score on the GLUE benchmark, the better the LLM is at generalizing across different tasks and domains.

It recently evolved into a new benchmark styled after GLUE and called SuperGLUE, which comes with more difficult tasks. It consists of eight challenging tasks that require more advanced reasoning skills than GLUE, such as natural language inference, question answering, coreference resolution, etc., a broad coverage diagnostic set that tests models on various linguistic capabilities and failure modes, and a leaderboard that ranks models based on their average score across all tasks.

The difference between the GLUE and the SuperGLUE benchmark is that the SuperGLUE benchmark is more challenging and realistic than the GLUE benchmark, as it covers more complex tasks and phenomena, requires models to handle multiple domains and formats, and has higher human performance baselines. The SuperGLUE benchmark is designed to drive research in the development of more general and robust NLU systems.

Massive Multitask Language Understanding (MMLU): This benchmark measures the knowledge of an LLM using zero-shot and few-shot settings.

Definition

The concept of zero-shot evaluation is a method of evaluating a language model without any labeled data or fine-tuning. It measures how well the language model can perform a new task by using natural language instructions or examples as prompts and computing the likelihood of the correct output given the input. It is the probability that a trained model will produce a particular set of tokens without needing any labeled training data.

This design adds complexity to the benchmark and aligns it more closely with the way we assess human performance. The benchmark comprises 14,000 multiple-choice questions categorized into 57 groups, spanning STEM, humanities, social sciences, and other fields. It covers a spectrum of difficulty levels, ranging from basic to advanced professional, assessing both general knowledge and problem-solving skills. The subjects encompass various areas, including traditional ones like mathematics and history, as well as specialized domains like law and ethics. The extensive range of subjects and depth of coverage make this benchmark valuable for uncovering any gaps in a model’s knowledge. Scoring is based on subject-specific accuracy and the average accuracy across all subjects.

HellaSwag: The HellaSwag evaluation framework is a method of evaluating LLMs on their ability to generate plausible and common sense continuations for given contexts. It is based on the HellaSwag dataset, which is a collection of 70,000 multiple-choice questions that cover diverse domains and genres, such as books, movies, recipes, etc. Each question consists of a context (a few sentences that describe a situation or an event) and four possible endings (one correct and three incorrect). The endings are designed to be hard to distinguish for LLMs, as they require world knowledge, common sense reasoning, and linguistic understanding.
TruthfulQA: This benchmark evaluates a language model’s accuracy in generating responses to questions. It includes 817 questions across 38 categories like health, law, finance, and politics. The questions are designed to mimic those that humans might answer incorrectly due to false beliefs or misunderstandings.
AI2 Reasoning Challenge (ARC): This benchmark is used to measure LLMs’ reasoning capabilities and to stimulate the development of models that can perform complex NLU tasks. It consists of a dataset of 7,787 multiple-choice science questions, assembled to encourage research in advanced question answering. The dataset is divided into an Easy set and a Challenge set, where the latter contains only questions that require complex reasoning or additional knowledge to answer correctly. The benchmark also provides a corpus of over 14 million science sentences that can be used as supporting evidence for the questions.

It is important to note that each evaluation framework has a focus on a specific feature. Namely, the GLUE benchmark focuses on grammar, paraphrasing, and text similarity, while MMLU focuses on generalized language understanding among various domains and tasks. Hence, while evaluating an LLM, it is important to have a clear understanding of the final goal, so that the most relevant evaluation framework can be used. Alternatively, if the goal is that of having the best of the breed in any task, it is key not to use only one evaluation framework, but rather an average of multiple frameworks.

In addition to that, in case no existing LLM is able to tackle your specific use cases, you still have a margin to customize those models and make them more tailored toward your application scenarios. In the next section, we are indeed going to cover the existing techniques of LLM customization, from the lightest ones (such as prompt engineering) up to the whole training of an LLM from scratch.