Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Pretrain Vision and Large Language Models in Python
Pretrain Vision and Large Language Models in Python

Pretrain Vision and Large Language Models in Python: End-to-end techniques for building and deploying foundation models on AWS

Arrow left icon
Profile Icon Emily Webber
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Full star icon Half star icon 4.4 (22 Ratings)
Paperback May 2023 258 pages 1st Edition
eBook
$9.99 $39.99
Paperback
$49.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Emily Webber
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Full star icon Half star icon 4.4 (22 Ratings)
Paperback May 2023 258 pages 1st Edition
eBook
$9.99 $39.99
Paperback
$49.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$9.99 $39.99
Paperback
$49.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Pretrain Vision and Large Language Models in Python

An Introduction to Pretraining Foundation Models

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin … The only thing that matters in the long run is the leveraging of computation.

– Richard Sutton, “The Bitter Lesson,” 2019 (1)

In this chapter, you’ll be introduced to foundation models, the backbone of many artificial intelligence and machine learning systems today. In particular, we will dive into their creation process, also called pretraining, and understand where it’s competitive to improve the accuracy of your models. We will discuss the core transformer architecture underpinning state-of-the-art models such as Stable Diffusion, BERT, Vision Transformers, OpenChatKit, CLIP, Flan-T5, and more. You will learn about the encoder and decoder frameworks, which work to solve a variety of use cases.

In this chapter, we will cover the following topics:

  • The art of pretraining and fine-tuning
  • The Transformer model architecture
  • State-of-the-art vision and language models
  • Encoders and decoders

The art of pretraining and fine-tuning

Humanity is one of Earth’s most interesting creatures. We are capable of producing the greatest of beauty and asking the most profound questions, and yet fundamental aspects about us are, in many cases, largely unknown. What exactly is consciousness? What is the human mind, and where does it reside? What does it mean to be human, and how do humans learn?

While scientists, artists, and thinkers from countless disciplines grapple with these complex questions, the field of computation marches forward to replicate (and in some cases, surpass) human intelligence. Today, applications from self-driving cars to writing screenplays, search engines, and question-answering systems have one thing in common – they all use a model, and sometimes many different kinds of models. Where do these models come from, how do they acquire intelligence, and what steps can we take to apply them for maximum impact? Foundation models are essentially compact representations of massive sets of data. The representation comes about through applying a pretraining objective onto the dataset, from predicting masked tokens to completing sentences. Foundation models are useful because once they have been created, through the process called pretraining, they can either be deployed directly or fine-tuned for a downstream task. An example of a foundation model deployed directly is Stable Diffusion, which was pretrained on billions of image-text pairs and generates useful images from text immediately after pretraining. An example of a fine-tuned foundation model is BERT, which was pretrained on large language datasets, but is most useful when adapted for a downstream domain, such as classification.

When applied in natural language processing, these models can complete sentences, classify text into different categories, produce summarizations, answer questions, do basic math, and generate creative artifacts such as poems and titles. In computer vision, foundation models are useful everywhere from image classification to generation, pose estimation to object detection, pixel mapping, and more.

This comes about because of defining a pretraining objective, which we’ll learn about in detail in this book. We’ll also cover its peer method, fine-tuning, which helps the model learn more about a specific domain. This more generally falls under the category of transfer learning, the practice of taking a pretrained neural network and supplying it with a novel dataset with the hope of enhancing its knowledge in a certain dimension. In both vision and language, these terms have some overlap and some clear distinctions, but don’t worry, we’ll cover them more throughout the chapters. I’m using the term fine-tuning to include the whole set of techniques to adapt a model to another domain, outside of the one where it was trained, not in the narrow, classic sense of the term.

Fundamentals – pretraining objectives

The heart of large-scale pretraining revolves around this core concept. A pretraining objective is a method that leverages information readily available in the dataset without requiring extensive human labeling. Some pretraining objectives involve masking, providing a unique [MASK] token in place of certain words, and training the model to fill in those words. Others take a different route, using the left-hand side of a given text string to attempt to generate the right-hand side.

The training process happens through a forward pass, sending your raw training data through the neural network to produce some output word. The loss function then computes the difference between this predicted word and the one found in the data. This difference between the predicted values and the actual values then serves as the basis for the backward pass. The backward pass itself usually leverages a type of stochastic gradient descent to update the parameters of the neural network with respect to that same loss function, ensuring that, next time around, it’s more likely to get a lower loss function.

In the case of BERT(2), the pretraining objective is called a masked token loss. For generative textual models of the GPT (3) variety, the pretraining objective is called causal language loss. Another way of thinking about this entire process is self-supervised learning, utilizing content already available in a dataset to serve as a signal to the model. In computer vision, you’ll also see this referred to as a pretext task. More on state-of-the-art models in the sections ahead!

Personally, I think pretraining is one of the most exciting developments in machine learning research. Why? Because, as Richard Sutton suggests controversially at the start of the chapter, it’s computationally efficient. Using pretraining, you can build a model from massive troves of information available on the internet, then combine all of this knowledge using your own proprietary data and apply it to as many applications as you can dream of. On top of that, pretraining opens the door for tremendous collaboration across company, country, language, and domain lines. The industry is truly just getting started in developing, perfecting, and exploiting the pretraining paradigm.

We know that pretraining is interesting and effective, but where is it competitive in its own right? Pretraining your own model is useful when your own proprietary dataset is very large and different from common research datasets, and primarily unlabeled. Most of the models we will learn about in this book are trained on similar corpora – Wikipedia, social media, books, and popular internet sites. Many of them focus on the English language, and few of them consciously use the rich interaction between visual and textual data. Throughout the book, we will learn about the nuances and different advantages of selecting and perfecting your pretraining strategies.

If your business or research hypothesis hinges on non-standard natural languages, such as financial or legal terminology, non-English languages, or rich knowledge from another domain, you may want to consider pretraining your own model from scratch. The core question you want to ask yourself is, How valuable is an extra one percentage point of accuracy in my model? If you do not know the answer to this question, then I strongly recommend spending some time getting yourself to an answer. We will spend time discussing how to do this in Chapter 2. Once you can confidently say an increase in the accuracy of my model is worth at least a few hundred thousand dollars, and even possibly a few million, then you are ready to begin pretraining your own model.

Now that we have learned about foundation models, how they come about through a process called pretraining, and how to adapt them to a specific domain through fine-tuning, let’s learn more about the Transformer model architecture.

The Transformer model architecture and self-attention

The Transformer model, presented in the now-famous 2017 paper Attention is all you need, marked a turning point for the machine learning industry. This is primarily because it used an existing mathematical technique, self-attention, to solve problems in NLP related to sequences. The Transformer certainly wasn’t the first attempt at modeling sequences, previously, recurrent neural networks (RNNs) and even convolutional neural networks (CNNs) were popular in language.

However, the Transformer made headlines because its training cost was a small fraction of the existing techniques. This is because the Transformer is fundamentally easier to parallelize, due to its core self-attention process, than previous techniques. It also set new world records in machine translation. The original Transformer used both an encoder and decoder, techniques we will dive into later throughout this chapter. This joint encoder-decoder pattern was followed directly by other models focused on similar text-to-text tasks, such as T5.

In 2018, Alex Radford and his team presented Generative Pretrained Transformers, a method inspired by the 2017 Transformer, but using only the decoder. Called GPT, this model handled large-scale unsupervised pretraining well, and it was paired with supervised fine-tuning to perform well on downstream tasks. As we mentioned previously, this causal language modeling technique optimizes the log probability of tokens, giving us a left-to-right ability to find the most probable word in a sequence.

In 2019, Jacob Devlin and his team presented BERT: Pretraining of Deep Bidirectional Transformers. BERT also adopted the pretraining, fine-tuning paradigm, but implemented a masked language modeling loss function that helped the model learn the impact of tokens both before and after them. This proved useful in disambiguating the meaning of words in different contexts and has aided encoder-only tasks such as classification ever since.

Despite their names, neither GPT nor BERT uses the full encoder-decoder as presented in the original Transformer paper but instead leverages the self-attention mechanism as core steps throughout the learning process. Thus, it is in fact the self-attention process we should understand.

First, remember that each word, or token, is represented as an embedding. This embedding is created simply by using a tokenizer, a pretrained data object for each model that maps the word to its appropriate dense vector. Once we have the embedding per token, we use learnable weights to generate three new vectors: key, query, and value. We then use matrix multiplication and a few steps to interact with the key and the query, using the value at the very end to determine what was most informative in the sequence overall. Throughout the training loop, we update these weights to get better and better interactions, as determined by your pretraining objective.

Your pretraining objective serves as a directional guide for how to update the model parameters. Said another way, your pretraining objective provides the primary signal to your stochastic gradient descent updating procedure, changing the weights of your model based on how incorrect your model predictions are. When you train for long periods of time, the parameters should reflect a decrease in loss, giving you an overall increase in accuracy.

Interestingly, the type of transformer heads will change slightly based on the different types of pretraining objectives you’re using. For example, a normal self-attention block uses information from both the left- and right-hand sides of a token to predict it. This is to provide the most informative contextual information for the prediction and is useful in masked language modeling. In practice, the self-attention heads are stacked to operate on full matrices of embeddings, giving us multi-head attention. Casual language modeling, however, uses a different type of attention head: masked self-attention. This limits the scope of predictive information to only the left-hand side of the matrix, forcing the model to learn a left-to-right procedure. This is in contrast to the more traditional self-attention, which has access to both the left and right sides of the sequence to make predictions.

Most of the time, in practice, and certainly throughout this book, you won’t need to code any transformers or self-attention heads from scratch. Through this book, we will, however, be diving into many model architectures, so it’s helpful to have this conceptual knowledge as a base.

From an intuitive perspective, what you’ll need to understand about transformers and self-attention is fewfold:

  • The transformer itself is a model entirely built upon a self-attention function: The self-attention function takes a set of inputs, such as embeddings, and performs mathematical operations to combine these. When combined with token (word or subword) masking, the model can effectively learn how significant certain parts of the embeddings, or the sequence, are to the other parts. This is the meaning of self-attention; the model is trying to understand which parts of the input dataset are most relevant to the other parts.
  • Transformers perform exceedingly well using sequences: Most of the benchmarks they’ve blown past in recent years are from NLP, for a good reason. The pretraining objectives for these include token masking and sequence completion, both of which rely on not just individual data points but the stringing of them together, and their combination. This is good news for those of you who already work with sequential data and an interesting challenge for those who don’t.
  • Transformers operate very well at large scales: The underlying attention head is easily parallelizable, which gives it a strong leg-up in reference to other candidate sequence-based neural network architectures such as RNNs, including Long Short-Term Memory (LSTM) based networks. The self-attention head can be set to trainable in the case of pretraining, or untrainable in the case of fine-tuning. When attempting to actually train the self-attention heads, as we’ll do throughout this book, the best performance you’ll see is when the transformers are applied on large datasets. How large they need to be, and what trade-offs you can make when electing to fine-tune or pretrain, is the subject of future chapters.

Transformers are not the only means of pretraining. As we’ll see throughout the next section, there are many different types of models, particularly in vision and multimodal cases, which can deliver state-of-the-art performance.

State-of-the-art vision and language models

If you’re new to machine learning, then there is a key concept you will eventually want to learn how to master, that is, state of the art. As you are aware, there are many different types of machine learning tasks, such as object detection, semantic segmentation, pose detection, text classification, and question answering. For each of these, there are many different research datasets. Each of these datasets provides labels, frequently for train, test, and validation splits. The datasets tend to be hosted by academic institutions, and each of these is purpose-built to train machine learning models that solve each of these types of problems.

When releasing a new dataset, researchers will frequently also release a new model that has been trained on the train set, tuned on the validation set, and separately evaluated on the test set. Their evaluation score on a new test set establishes a new state of the art for this specific type of modeling problem. When publishing certain types of papers, researchers will frequently try to improve performance in this area – for example, by trying to increase accuracy by a few percentage points on a handful of datasets.

The reason state-of-the-art performance matters for you is that it is a strong indication of how well your model is likely to perform in the best possible scenario. It isn’t easy to replicate most research results, and frequently, labs will have developed special techniques to improve performance that may not be easily observed and replicated by others. This is especially true when datasets and code repositories aren’t shared publicly, as is the case with GPT-3. This is acutely true when training methods aren’t disclosed, as with GPT-4.

However, given sufficient resources, it is possible to achieve similar performance as reported in top papers. An excellent place to find state-of-the-art performance at any given point in time is an excellent website, Papers With Code, maintained by Meta and enhanced by the community. By using this free tool, you can easily find top papers, datasets, models, and GitHub sites with example code. Additionally, they have great historical views, so you can see how the top models in different datasets have evolved over time.

In later chapters on preparing datasets and picking models, we’ll go into more detail on how to find the right examples for you, including how to determine how similar to and different from your own goals they are. Later in the book, we’ll also help you determine the optimal models, and sizes for them. Right now, let’s look at some models that, as of this writing, are currently sitting at the top of their respective leaderboards.

Top vision models as of April 2023

First, let’s take a quick look at the models performing the best today within image tasks such as classification and generation.

Dataset

Best model

From Transformer

Performance

ImageNet

Basic-L (Lion fine-tuned)

Yes

91.10% top 1% accuracy

CIFAR-10

ViT-H/14 (1)

Yes

99.5% correct

COCO

InternImage-H (M3I Pre-training: https://paperswithcode.com/paper/internimage-exploring-large-scale-vision)

No

65.0 Box AP

STL-10

Diffusion ProjectedGAN

No

6.91 FID (generation)

ObjectNet

CoCa

Yes

82.7% top 1% accuracy

MNIST

Heterogeneous ensemble with simple CNN (1)

No

99.91% accuracy (0.09% error)

Table 1.1 – Top image results

At first glance, these numbers may seem intimidating. After all, many of them are near or close to 99% accurate! Isn’t that too high of a bar for beginning or intermediate machine learning practitioners?

Before we get too carried away with doubt and fear, it’s helpful to understand that most of these accuracy scores came at least five years after the research dataset was published. If we analyze the historical graphs available on Paper With Code, it’s easy to see that when the first researchers published their datasets, initial accuracy scores were closer to 60%. Then, it took many years of hard work, across diverse organizations and teams, to finally produce models capable of hitting the 90s. So, don’t lose heart! If you put in the time, you too can train a model that establishes a new state-of-the-art performance in a given area. This part is science, not magic.

You’ll notice that while some of these models do in fact adopt a Transformer-inspired backend, some do not. Upon closer inspection, you’ll also see that some of these models rely on the pretrain and fine-tune paradigm we’ll be learning about in this book, but not all of them. If you’re new to machine learning, then this discrepancy is something to start getting comfortable with! Robust and diverse scientific debate, perspectives, insights, and observations are critical aspects of maintaining healthy communities and increasing the quality of outcomes across the field as a whole. This means that you can, and should, expect some divergence in methods you come across, and that’s a good thing.

Now that you have a better understanding of top models in computer vision these days, let’s explore one of the earliest methods combining techniques from large language models with vision: contrastive pretraining and natural language supervision.

Contrastive pretraining and natural language supervision

What’s interesting about both modern and classic image datasets, from Fei-Fei Li’s 2006 ImageNet to the LAION-5B as used in 2022 Stable Diffusion, is that the labels themselves are composed of natural language. Said another way, because the scope of the images includes objects from the physical world, the labels necessarily are more nuanced than single digits. Broadly speaking, this type of problem framing is called natural language supervision.

Imagine having a large dataset of tens of millions of images, each provided with captions. Beyond simply naming the objects, a caption gives you more information about the content of the images. A caption can be anything from Stella sits on a yellow couch to Pepper, the Australian pup. In just a few words we immediately get more context than simply describing the objects. Now, imagine using a pretrained model, such as an encoder, to process the language into a dense vector representation. Then, combine this with another pretrained model, this time an image encoder, to process the image into another dense vector representation. Combine both of these in a learnable matrix, and you are on your way to contrastive pretraining! Also presented by Alex Radford and the team, just a few years after their work on GPT, this method gives us both a way to jointly learn about the relationship between both images and language and a model well suited to do so. The model is called Contrastive Language-Image Pretraining (CLIP).

CLIP certainly isn’t the only vision-language pretraining task that uses natural language supervision. One year earlier, in 2019, a research team from China proposed a Visual-Linguistic BERT model attempting a similar goal. Since then, the joint training of vision-and-language foundation models has become very popular, with Flamingo, Imagen, and Stable Diffusion all presenting interesting work.

Now that we’ve learned a little bit about joint vision-and-language contrastive pretraining, let’s explore today’s top models in language.

Top language models as of April 2023

Now, let’s evaluate some of today’s best-in-class models for a task extremely pertinent to foundation models, and thus this book: language modeling. This table shows a set of language model benchmark results across a variety of scenarios.

Dataset

Best model

From Transformer

Performance

WikiText-103

Hybrid H3 (2.7B params)

No

10.60 test perplexity

Penn Treebank (Word Level)

GPT-3 (Zero-Shot) (1)

Yes

20.5 test perplexity

LAMBADA

PaLM-540B (Few-Shot) (1)

Yes

89.7% accuracy

Penn Treebank (Character Level)

Mogrifer LSTM + dynamic eval (1)

No

1.083 bit per character

C4 (Colossal Clean Crawled Corpus)

Primer

No

12.35 perplexity

Table 1.2 – Top language modeling results

First, let’s try to answer a fundamental question. What is language modeling, and why does it matter? Language modeling as known today appears to have been formalized in two cornerstone papers: BERT (9) and GPT (10). The core concept that inspired both papers is deceptively simple: how do we better use unsupervised natural language?

As is no doubt unsurprising to you, the vast majority of natural language in our world has no direct digital label. Some natural language lends itself well to concrete labels, such as cases where objectivity is beyond doubt. This can include accuracy in answering questions, summarization, high-level sentiment analysis, document retrieval, and more.

But the process of finding these labels and producing the datasets necessary for them can be prohibitive, as it is entirely manual. At the same time, many unsupervised datasets get larger by the minute. Now that much of the global dialog is online, datasets rich in variety are easy to access. So, how can ML researchers position themselves to benefit from these large, unsupervised datasets?

This is exactly the problem that language modeling seeks to solve. Language modeling is a process to apply mathematical techniques on large corpora of unlabelled text, relying on a variety of pretraining objectives to enable the model to teach itself about the text. Also called self-supervision, the precise method of learning varies based on the model at hand. BERT applies a mask randomly throughout the dataset and learns to predict the word hidden by the mask, using an encoder. GPT uses a decoder to predict left-to-right, starting at the beginning of a sentence, for example, and learning how to predict the end of the sentence. Models in the T5 family use both encoders and decoders to learn text-to-text tasks, such as translation and search. As proposed in ELECTRA (11), another alternative is a token replacement objective, which opts to inject new tokens into the original text, rather than masking them.

Fundamentals – fine-tuning

Foundational language models are only useful in applications when paired with their peer method, fine-tuning. The intuition behind fine-tuning is very understandable; we want to take a foundational model pretrained elsewhere and apply a much smaller set of data to make it more focused and useful for our specific task. We can also call this domain adaptation – adapting a pretrained model to an entirely different domain that was not included in its pretraining task.

Fine-tuning tasks are everywhere! You can take a base language model, such as BERT, and fine-tune it for text classification. Or question answering. Or named entity recognition. Or you could take a different model, GPT-2 for example, and fine-tune it for summarization. Or you could take something like T5 and fine-tune it for translation. The basic idea is that you are leveraging the intelligence of the foundation model. You’re leveraging the compute, the dataset, the large neural network, and ultimately, the distribution method the researchers leveraged simply by inheriting their pretrained artifact. Then, you can optionally add extra layers to the network yourself, or more likely, use a software framework such as Hugging Face to simplify the process. Hugging Face has done an amazing job building an extremely popular open source framework with tens of thousands of pretrained models, and we’ll see in future chapters how to best utilize their examples to build our own models in both vision and language. There are many different types of fine-tuning, from parameter-efficient fine-tuning to instruction-fine-tuning, chain of thought, and even methods that don’t strictly update the core model parameters such as retrieval augmented generation. We’ll discuss these later in the book.

As we will discover in future chapters, foundational language and vision models are not without their negative aspects. For starters, their extremely large compute requirements place significant energy demands on service providers. Ensuring that energy is met through sustainable means and that the modeling process is as efficient as possible are top goals for the models of the future. These large compute requirements are also obviously quite expensive, posing inherent challenges for those without sufficient resources. I would argue, however, that the core techniques you’ll learn throughout this book are relevant across a wide spectrum of computational needs and resourcing. Once you’ve demonstrated success at a smaller scale of pretraining, it’s usually much easier to justify the additional ask.

Additionally, as we will see in future chapters, large models are infamous for their ability to inherit social biases present in their training data. From associating certain employment aspects with gender to classifying criminal likelihood based on race, researchers have identified hundreds (9) of ways bias can creep into NLP systems. As with all technology, designers and developers must be aware of these risks and take steps to mitigate them. In later chapters, I’ll identify a variety of steps you can take today to reduce these risks.

Next, let’s learn about a core technique used in defining appropriate experiments for language models: the scaling laws!

Language technique spotlight – causal modeling and the scaling laws

You’ve no doubt heard of the now-infamous model ChatGPT. For a few years, a San Francisco-based AI firm, OpenAI, developed research with a mission to improve humanity’s outcomes around artificial intelligence. Toward that end, they made bold leaps in scaling language models, deriving formulas as one might in physics to explain the performance of LLMs at scale. They originally positioned themselves as a non-profit, releasing their core insights and the code to reproduce them. Four years after its founding, however, they pivoted to cutting exclusive billion-dollar deals with Microsoft. Now, their 600-strong R&D teams focus on developing proprietary models and techniques, and many open source projects attempt to replicate and improve on their offerings. Despite this controversial pivot, the team at OpenAI gave the industry a few extremely useful insights. The first is GPT, and the second is the scaling laws.

As mentioned previously, GPT-based models use causal language modeling to learn how best to complete text. This means using a left-to-right completion learning criteria, which updates the model’s learnable parameters until the text is completely accurate. While the first GPT model of 2018 was itself useful, the real excitement came years later in two phases. First, Jared Kaplan lead a team at OpenAI to suggest a novel concept: using formulas inspired by his work in physics to estimate the impact the size of the model, dataset, and overall compute environment will have on the loss of the model. These Scaling Laws for Neural Language Models (9) suggested that the optimal model size for a given compute environment was massive.

The original GPT model of 2018 was only 117 million parameters, and its second version, aptly named GPT-2, increased the model size by up to 10x. This increase in parameter size more than doubled the overall accuracy of the model. Encouraged by these results, and fuelled by Kaplan’s theoretical and empirical findings, OpenAI boldly increased the model parameter size by another 10x, giving us GPT-3.

As the model increased in size, from 1.3 billion parameters to 13 billion, ultimately hitting 175 billion parameters, accuracy also took a huge leap! This result catalyzed the field of NLP, unleashing new use cases and a flurry of new work exploring and extending these impacts. Since then, new work has explored both larger (PaLM (9)) and smaller (Chinchilla (10)) models, with Chinchilla presenting an update to the scaling laws entirely. Yann LeCunn’s team at Meta has also presented smaller models that outperform the larger ones in specific areas, such as question-answering (Atlas (9)). Amazon has also presented two models that outperform GPT-3: the AlexaTM and MM-COT. Numerous teams have also undertaken efforts to produce open source versions of GPT-3, such as Hugging Face’s BLOOM, EleutherAI’S GPT-J, and Meta’s OPT.

The rest of this book is dedicated to discussing these models – where they come from, what they are good for, and especially how to train your own! While much excellent work has covered using these pretrained models in production through fine-tuning, such as Hugging Face’s own Natural Language Processing with Transformers (Tunstall et al., 2022), I continue to believe that pretraining your own foundation model is probably the most interesting computational intellectual exercise you can embark on today. I also believe it’s one of the most profitable. But more on that ahead!

Next, let’s learn about two key model components you’ll need to understand in detail: encoders and decoders.

Encoders and decoders

Now, I’d like to briefly introduce you to two key topics that you’ll see in the discussion of transformer-based models: encoders and decoders. Let’s establish some basic intuition to help you understand what they are all about. An encoder is simply a computational graph (or neural network, function, or object depending on your background), which takes an input with a larger feature space and returns an object with a smaller feature space. We hope (and demonstrate computationally) that the encoder is able to learn what is most essential about the provided input data.

Typically, in large language and vision models, the encoder itself is composed of a number of multi-head self-attention objects. This means that in transformer-based models, an encoder is usually a number of self-attention steps, learning what is most essential about the provided input data and passing this onto the downstream model. Let’s look at a quick visual:

Figure 1.1 – Encoders and decoders

Figure 1.1 – Encoders and decoders

Intuitively, as you can see in the preceding figure, the encoder starts with a larger input space and iteratively compresses this to a smaller latent space. In the case of classification, this is just a classification head with output allotted for each class. In the case of masked language modeling, encoders are stacked on top of each other to better predict tokens to replace the masks. This means the encoders output an embedding, or a numerical representation of that token, and after prediction, the tokenizer is reused to translate that embedding back into natural language.

One of the earliest large language models, BERT, is an encoder-only model. Most other BERT-based models, such as DeBERTa, DistiliBERT, RoBERTa, DeBERTa, and others in this family use encoder-only model architectures. Decoders operate exactly in reverse, starting with a compressed representation and iteratively recomposing that back into a larger feature space. Both encoders and decoders can be combined, as in the original Transformer, to solve text-to-text problems.

To make it easier, here’s a short table quickly summarizing the three types of self-attention blocks we’ve looked at, encoders, decoders, and their combination.

Size of inputs and outputs

Type of self-attention blocks

Machine learning tasks

Example models

Long to short

Encoder

Classification, any dense representation

BERT, DeBERTa, DistiliBERT, RoBERTa, XLM, AlBERT, CLIP, VL-BERT, Vision Transformer

Short to long

Decoder

Generation, summarization, question-answering, any sparse representation

GPT, GPT-2, GPT-Neo, GPT-J, ChatGPT, GPT-4, BLOOM, OPT

Equal

Encoder-decoder

Machine translation, style translation

T5, BART, BigBird, FLAN-T5, Stable Diffusion

Table 1.3 – Encoders, decoders, and their combination

Now that you have a better understanding of encoders, decoders, and the models they create, let’s close out the chapter with a quick recap of all the concepts you just learned about.

Summary

We’ve covered a lot in just this first chapter! Let’s quickly recap some of the top themes before moving on. First, we looked at the art of pretraining and fine-tuning, including a few key pretraining objects such as masked language and causal language modeling. We learned about the Transformer model architecture, including the core self-attention mechanism with its variant. We looked at state-of-the-art vision and language models, including spotlights on contrastive pretraining from natural language supervision, and scaling laws for neural language models. We learned about encoders, decoders, and their combination, which are useful throughout the vision and language domains today.

Now that you have a great conceptual and applied basis to understand pretraining foundation models, let’s look at preparing your dataset: part one.

References

Please go through the following content for more information on a few topics covered in the chapter:

  1. The Bitter Lesson, Rich Sutton, March 13, 2019: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
  2. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: https://aclanthology.org/N19-1423/. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  3. Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, Volume 33. Pages 1877-1901. Curran Associates, Inc.
  4. AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE: https://arxiv.org/pdf/2010.11929v2.pdf
  5. AN ENSEMBLE OF SIMPLE CONVOLUTIONAL NEURAL NETWORK MODELS FOR MNIST DIGIT RECOGNITION: https://arxiv.org/pdf/2008.10400v2.pdf
  6. Language Models are Few-Shot Learners: https://arxiv.org/pdf/2005.14165v4.pdf
  7. PaLM: Scaling Language Modeling with Pathways: https://arxiv.org/pdf/2204.02311v3.pdf
  8. MOGRIFIER LSTM: https://arxiv.org/pdf/1909.01792v2.pdf
  9. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: https://arxiv.org/pdf/1810.04805.pdf
  10. Improving Language Understanding by Generative Pre-Training: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
  11. ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS: https://arxiv.org/pdf/2003.10555.pdf
  12. Language (Technology) is Power: A Critical Survey of “Bias” in NLP: https://arxiv.org/pdf/2005.14050.pdf
  13. Scaling Laws for Neural Language Models: https://arxiv.org/pdf/2001.08361.pdf
  14. PaLM: Scaling Language Modeling with Pathways: https://arxiv.org/pdf/2204.02311.pdf
  15. Training Compute-Optimal Large Language Models: https://arxiv.org/pdf/2203.15556.pdf
  16. Atlas: Few-shot Learning with Retrieval Augmented Language Models: https://arxiv.org/pdf/2208.03299.pdf
Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Learn to develop, train, tune, and apply foundation models with optimized end-to-end pipelines
  • Explore large-scale distributed training for models and datasets with AWS and SageMaker examples
  • Evaluate, deploy, and operationalize your custom models with bias detection and pipeline monitoring

Description

Foundation models have forever changed machine learning. From BERT to ChatGPT, CLIP to Stable Diffusion, when billions of parameters are combined with large datasets and hundreds to thousands of GPUs, the result is nothing short of record-breaking. The recommendations, advice, and code samples in this book will help you pretrain and fine-tune your own foundation models from scratch on AWS and Amazon SageMaker, while applying them to hundreds of use cases across your organization. With advice from seasoned AWS and machine learning expert Emily Webber, this book helps you learn everything you need to go from project ideation to dataset preparation, training, evaluation, and deployment for large language, vision, and multimodal models. With step-by-step explanations of essential concepts and practical examples, you’ll go from mastering the concept of pretraining to preparing your dataset and model, configuring your environment, training, fine-tuning, evaluating, deploying, and optimizing your foundation models. You will learn how to apply the scaling laws to distributing your model and dataset over multiple GPUs, remove bias, achieve high throughput, and build deployment pipelines. By the end of this book, you’ll be well equipped to embark on your own project to pretrain and fine-tune the foundation models of the future.

Who is this book for?

If you’re a machine learning researcher or enthusiast who wants to start a foundation modelling project, this book is for you. Applied scientists, data scientists, machine learning engineers, solution architects, product managers, and students will all benefit from this book. Intermediate Python is a must, along with introductory concepts of cloud computing. A strong understanding of deep learning fundamentals is needed, while advanced topics will be explained. The content covers advanced machine learning and cloud techniques, explaining them in an actionable, easy-to-understand way.

What you will learn

  • Find the right use cases and datasets for pretraining and fine-tuning
  • Prepare for large-scale training with custom accelerators and GPUs
  • Configure environments on AWS and SageMaker to maximize performance
  • Select hyperparameters based on your model and constraints
  • Distribute your model and dataset using many types of parallelism
  • Avoid pitfalls with job restarts, intermittent health checks, and more
  • Evaluate your model with quantitative and qualitative insights
  • Deploy your models with runtime improvements and monitoring pipelines

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : May 31, 2023
Length: 258 pages
Edition : 1st
Language : English
ISBN-13 : 9781804618257
Category :
Languages :
Concepts :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : May 31, 2023
Length: 258 pages
Edition : 1st
Language : English
ISBN-13 : 9781804618257
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 144.97
Building AI Applications with ChatGPT APIs
$44.99
Modern Generative AI with ChatGPT and OpenAI Models
$49.99
Pretrain Vision and Large Language Models in Python
$49.99
Total $ 144.97 Stars icon
Banner background image

Table of Contents

22 Chapters
Part 1: Before Pretraining Chevron down icon Chevron up icon
Chapter 1: An Introduction to Pretraining Foundation Models Chevron down icon Chevron up icon
Chapter 2: Dataset Preparation: Part One Chevron down icon Chevron up icon
Chapter 3: Model Preparation Chevron down icon Chevron up icon
Part 2: Configure Your Environment Chevron down icon Chevron up icon
Chapter 4: Containers and Accelerators on the Cloud Chevron down icon Chevron up icon
Chapter 5: Distribution Fundamentals Chevron down icon Chevron up icon
Chapter 6: Dataset Preparation: Part Two, the Data Loader Chevron down icon Chevron up icon
Part 3: Train Your Model Chevron down icon Chevron up icon
Chapter 7: Finding the Right Hyperparameters Chevron down icon Chevron up icon
Chapter 8: Large-Scale Training on SageMaker Chevron down icon Chevron up icon
Chapter 9: Advanced Training Concepts Chevron down icon Chevron up icon
Part 4: Evaluate Your Model Chevron down icon Chevron up icon
Chapter 10: Fine-Tuning and Evaluating Chevron down icon Chevron up icon
Chapter 11: Detecting, Mitigating, and Monitoring Bias Chevron down icon Chevron up icon
Chapter 12: How to Deploy Your Model Chevron down icon Chevron up icon
Part 5: Deploy Your Model Chevron down icon Chevron up icon
Chapter 13: Prompt Engineering Chevron down icon Chevron up icon
Chapter 14: MLOps for Vision and Language Chevron down icon Chevron up icon
Chapter 15: Future Trends in Pretraining Foundation Models Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.4
(22 Ratings)
5 star 68.2%
4 star 18.2%
3 star 4.5%
2 star 4.5%
1 star 4.5%
Filter icon Filter
Top Reviews

Filter reviews by




N/A Jan 19, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Feefo Verified review Feefo
alla Jul 07, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This compelling guide is a treasure for machine learning engineers aspiring to dive deep into working within the AWS ecosystem.It proficiently lays out the practices of model algorithms and development, from selecting the ideal design and training dataset to deploying on AWS.With the author's expertise and insightful predictions, readers are navigated through the wonders of Sagemaker, model evaluation, and future trends.Highly recommended for those ready to embark on this awesome journey.
Amazon Verified review Amazon
Steven Fernandes Aug 11, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
In this comprehensive guide, readers are taken on a journey through the intricacies of machine learning model optimization, starting with the selection of ideal use cases and datasets for pretraining and fine-tuning. The book stands out in its detailed discussions on harnessing the power of custom accelerators and GPUs for large-scale training. It provides a hands-on tutorial on configuring AWS and SageMaker environments for optimal performance, which many practitioners will find invaluable. One of the key highlights is the chapter on hyperparameter selection, where the relationship between model constraints and optimal settings is delved into with clarity. The sections on model and dataset distribution using various parallelism techniques offer insights into efficient data handling. Practical advice, such as how to navigate challenges like job restarts and health checks, ensures readers are well-equipped to handle real-world scenarios. Finally, the guide culminates with thorough methods to evaluate and deploy models, emphasizing both quantitative and qualitative insights, and the importance of effective runtime improvements and monitoring. A must-read for anyone keen on mastering the art of machine learning at scale.
Amazon Verified review Amazon
Deep P. Jul 31, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Pretrain Vision and Large Language Models in Python is a book by Emily Webber that teaches you how to train and deploy foundation models on AWS. The book covers a wide range of topics, from the basics of pretraining to the deployment of models on SageMaker.The book is well-written and easy to follow, and it includes a number of helpful resources, such as code samples, tutorials, and links to additional information. The author does a great job of explaining complex concepts in a clear and concise way, and she provides plenty of examples to help readers understand how to train and deploy foundation models on AWS.One of the things I really liked about this book is that it goes beyond just explaining the concepts. The author also provides guidance on how to use these concepts to solve real-world problems. For example, she shows how to use foundation models to build applications that can automatically generate text, translate languages, and answer questions.Overall, I thought Pretrain Vision and Large Language Models in Python was an excellent resource for anyone who wants to learn more about foundation models or who wants to use these models to build their own applications. The book is well-written, informative, and easy to follow. I highly recommend it.
Amazon Verified review Amazon
Om S Jun 05, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
As a long-time admirer of Emily Webber and an enthusiast of SageMaker and Large Language Models (LLMs), I've been eagerly anticipating the release of this book. As a regular user of AWS, I can confidently state that this book is a substantial contribution to the advanced field of Machine Learning (ML).In the fast-paced world of ML, where LLMs are at the forefront of innovation, keeping up with the rapid advancements can be challenging. However, Emily's book, announced just before the LLM boom, provides a solid foundation that enables you to navigate these developments with ease. It has proven to be a timely resource for those keen on understanding and leveraging the power of LLMs.Book Summary:Comprehensive Coverage: The book offers an in-depth exploration of training vision and large language models, covering all stages from project ideation, dataset preparation, training, evaluation, to deployment for large language, vision, and multimodal models.Expert Guidance: Authored by Emily Webber, a seasoned AWS and machine learning expert, the book provides industry-expert guidance and practical advice, making it a valuable resource for both beginners and experienced practitioners.Practical Approach: The book is replete with practical examples and code samples that help readers understand how to pretrain and fine-tune their own foundation models on AWS and Amazon SageMaker.Bias Detection: A unique feature of the book is its focus on bias detection and pipeline monitoring, which are critical aspects of model development and deployment.Advanced Topics: The book delves into advanced topics like large-scale distributed training, hyperparameter selection, and model distribution, providing readers with a deep understanding of these complex areas.Future Trends: The final chapter on future trends in pretraining foundation models gives readers a glimpse into what's next in the field, keeping them ahead of the curve.In conclusion, if you're looking to ride the wave of LLMs and want to do so using AWS, this book is a must-read. It's more than just a guide; not a beginner's book!!! It's a comprehensive resource that empowers you to navigate the fast-paced world of ML with confidence and proficiency. Emily Webber's expertise shines through each page, making this book an invaluable asset for anyone in the field. As LLMs continue to evolve and revolutionize various sectors, this book stands as a testament to their transformative potential and a guide for those looking to be part of this exciting journey. This is just the start..... Transformers......... !!!
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.