If you’re new to machine learning, then there is a key concept you will eventually want to learn how to master, that is, state of the art. As you are aware, there are many different types of machine learning tasks, such as object detection, semantic segmentation, pose detection, text classification, and question answering. For each of these, there are many different research datasets. Each of these datasets provides labels, frequently for train, test, and validation splits. The datasets tend to be hosted by academic institutions, and each of these is purpose-built to train machine learning models that solve each of these types of problems.
When releasing a new dataset, researchers will frequently also release a new model that has been trained on the train set, tuned on the validation set, and separately evaluated on the test set. Their evaluation score on a new test set establishes a new state of the art for this specific type of modeling problem. When publishing certain types of papers, researchers will frequently try to improve performance in this area – for example, by trying to increase accuracy by a few percentage points on a handful of datasets.
The reason state-of-the-art performance matters for you is that it is a strong indication of how well your model is likely to perform in the best possible scenario. It isn’t easy to replicate most research results, and frequently, labs will have developed special techniques to improve performance that may not be easily observed and replicated by others. This is especially true when datasets and code repositories aren’t shared publicly, as is the case with GPT-3. This is acutely true when training methods aren’t disclosed, as with GPT-4.
However, given sufficient resources, it is possible to achieve similar performance as reported in top papers. An excellent place to find state-of-the-art performance at any given point in time is an excellent website, Papers With Code, maintained by Meta and enhanced by the community. By using this free tool, you can easily find top papers, datasets, models, and GitHub sites with example code. Additionally, they have great historical views, so you can see how the top models in different datasets have evolved over time.
In later chapters on preparing datasets and picking models, we’ll go into more detail on how to find the right examples for you, including how to determine how similar to and different from your own goals they are. Later in the book, we’ll also help you determine the optimal models, and sizes for them. Right now, let’s look at some models that, as of this writing, are currently sitting at the top of their respective leaderboards.
Top vision models as of April 2023
First, let’s take a quick look at the models performing the best today within image tasks such as classification and generation.
Dataset
|
Best model
|
From Transformer
|
Performance
|
ImageNet
|
Basic-L (Lion fine-tuned)
|
Yes
|
91.10% top 1% accuracy
|
CIFAR-10
|
ViT-H/14 (1)
|
Yes
|
99.5% correct
|
COCO
|
InternImage-H (M3I Pre-training: https://paperswithcode.com/paper/internimage-exploring-large-scale-vision)
|
No
|
65.0 Box AP
|
STL-10
|
Diffusion ProjectedGAN
|
No
|
6.91 FID (generation)
|
ObjectNet
|
CoCa
|
Yes
|
82.7% top 1% accuracy
|
MNIST
|
Heterogeneous ensemble with simple CNN (1)
|
No
|
99.91% accuracy (0.09% error)
|
Table 1.1 – Top image results
At first glance, these numbers may seem intimidating. After all, many of them are near or close to 99% accurate! Isn’t that too high of a bar for beginning or intermediate machine learning practitioners?
Before we get too carried away with doubt and fear, it’s helpful to understand that most of these accuracy scores came at least five years after the research dataset was published. If we analyze the historical graphs available on Paper With Code, it’s easy to see that when the first researchers published their datasets, initial accuracy scores were closer to 60%. Then, it took many years of hard work, across diverse organizations and teams, to finally produce models capable of hitting the 90s. So, don’t lose heart! If you put in the time, you too can train a model that establishes a new state-of-the-art performance in a given area. This part is science, not magic.
You’ll notice that while some of these models do in fact adopt a Transformer-inspired backend, some do not. Upon closer inspection, you’ll also see that some of these models rely on the pretrain and fine-tune paradigm we’ll be learning about in this book, but not all of them. If you’re new to machine learning, then this discrepancy is something to start getting comfortable with! Robust and diverse scientific debate, perspectives, insights, and observations are critical aspects of maintaining healthy communities and increasing the quality of outcomes across the field as a whole. This means that you can, and should, expect some divergence in methods you come across, and that’s a good thing.
Now that you have a better understanding of top models in computer vision these days, let’s explore one of the earliest methods combining techniques from large language models with vision: contrastive pretraining and natural language supervision.
Contrastive pretraining and natural language supervision
What’s interesting about both modern and classic image datasets, from Fei-Fei Li’s 2006 ImageNet to the LAION-5B as used in 2022 Stable Diffusion, is that the labels themselves are composed of natural language. Said another way, because the scope of the images includes objects from the physical world, the labels necessarily are more nuanced than single digits. Broadly speaking, this type of problem framing is called natural language supervision.
Imagine having a large dataset of tens of millions of images, each provided with captions. Beyond simply naming the objects, a caption gives you more information about the content of the images. A caption can be anything from Stella sits on a yellow couch to Pepper, the Australian pup. In just a few words we immediately get more context than simply describing the objects. Now, imagine using a pretrained model, such as an encoder, to process the language into a dense vector representation. Then, combine this with another pretrained model, this time an image encoder, to process the image into another dense vector representation. Combine both of these in a learnable matrix, and you are on your way to contrastive pretraining! Also presented by Alex Radford and the team, just a few years after their work on GPT, this method gives us both a way to jointly learn about the relationship between both images and language and a model well suited to do so. The model is called Contrastive Language-Image Pretraining (CLIP).
CLIP certainly isn’t the only vision-language pretraining task that uses natural language supervision. One year earlier, in 2019, a research team from China proposed a Visual-Linguistic BERT model attempting a similar goal. Since then, the joint training of vision-and-language foundation models has become very popular, with Flamingo, Imagen, and Stable Diffusion all presenting interesting work.
Now that we’ve learned a little bit about joint vision-and-language contrastive pretraining, let’s explore today’s top models in language.
Top language models as of April 2023
Now, let’s evaluate some of today’s best-in-class models for a task extremely pertinent to foundation models, and thus this book: language modeling. This table shows a set of language model benchmark results across a variety of scenarios.
Dataset
|
Best model
|
From Transformer
|
Performance
|
WikiText-103
|
Hybrid H3 (2.7B params)
|
No
|
10.60 test perplexity
|
Penn Treebank (Word Level)
|
GPT-3 (Zero-Shot) (1)
|
Yes
|
20.5 test perplexity
|
LAMBADA
|
PaLM-540B (Few-Shot) (1)
|
Yes
|
89.7% accuracy
|
Penn Treebank (Character Level)
|
Mogrifer LSTM + dynamic eval (1)
|
No
|
1.083 bit per character
|
C4 (Colossal Clean Crawled Corpus)
|
Primer
|
No
|
12.35 perplexity
|
Table 1.2 – Top language modeling results
First, let’s try to answer a fundamental question. What is language modeling, and why does it matter? Language modeling as known today appears to have been formalized in two cornerstone papers: BERT (9) and GPT (10). The core concept that inspired both papers is deceptively simple: how do we better use unsupervised natural language?
As is no doubt unsurprising to you, the vast majority of natural language in our world has no direct digital label. Some natural language lends itself well to concrete labels, such as cases where objectivity is beyond doubt. This can include accuracy in answering questions, summarization, high-level sentiment analysis, document retrieval, and more.
But the process of finding these labels and producing the datasets necessary for them can be prohibitive, as it is entirely manual. At the same time, many unsupervised datasets get larger by the minute. Now that much of the global dialog is online, datasets rich in variety are easy to access. So, how can ML researchers position themselves to benefit from these large, unsupervised datasets?
This is exactly the problem that language modeling seeks to solve. Language modeling is a process to apply mathematical techniques on large corpora of unlabelled text, relying on a variety of pretraining objectives to enable the model to teach itself about the text. Also called self-supervision, the precise method of learning varies based on the model at hand. BERT applies a mask randomly throughout the dataset and learns to predict the word hidden by the mask, using an encoder. GPT uses a decoder to predict left-to-right, starting at the beginning of a sentence, for example, and learning how to predict the end of the sentence. Models in the T5 family use both encoders and decoders to learn text-to-text tasks, such as translation and search. As proposed in ELECTRA (11), another alternative is a token replacement objective, which opts to inject new tokens into the original text, rather than masking them.
Fundamentals – fine-tuning
Foundational language models are only useful in applications when paired with their peer method, fine-tuning. The intuition behind fine-tuning is very understandable; we want to take a foundational model pretrained elsewhere and apply a much smaller set of data to make it more focused and useful for our specific task. We can also call this domain adaptation – adapting a pretrained model to an entirely different domain that was not included in its pretraining task.
Fine-tuning tasks are everywhere! You can take a base language model, such as BERT, and fine-tune it for text classification. Or question answering. Or named entity recognition. Or you could take a different model, GPT-2 for example, and fine-tune it for summarization. Or you could take something like T5 and fine-tune it for translation. The basic idea is that you are leveraging the intelligence of the foundation model. You’re leveraging the compute, the dataset, the large neural network, and ultimately, the distribution method the researchers leveraged simply by inheriting their pretrained artifact. Then, you can optionally add extra layers to the network yourself, or more likely, use a software framework such as Hugging Face to simplify the process. Hugging Face has done an amazing job building an extremely popular open source framework with tens of thousands of pretrained models, and we’ll see in future chapters how to best utilize their examples to build our own models in both vision and language. There are many different types of fine-tuning, from parameter-efficient fine-tuning to instruction-fine-tuning, chain of thought, and even methods that don’t strictly update the core model parameters such as retrieval augmented generation. We’ll discuss these later in the book.
As we will discover in future chapters, foundational language and vision models are not without their negative aspects. For starters, their extremely large compute requirements place significant energy demands on service providers. Ensuring that energy is met through sustainable means and that the modeling process is as efficient as possible are top goals for the models of the future. These large compute requirements are also obviously quite expensive, posing inherent challenges for those without sufficient resources. I would argue, however, that the core techniques you’ll learn throughout this book are relevant across a wide spectrum of computational needs and resourcing. Once you’ve demonstrated success at a smaller scale of pretraining, it’s usually much easier to justify the additional ask.
Additionally, as we will see in future chapters, large models are infamous for their ability to inherit social biases present in their training data. From associating certain employment aspects with gender to classifying criminal likelihood based on race, researchers have identified hundreds (9) of ways bias can creep into NLP systems. As with all technology, designers and developers must be aware of these risks and take steps to mitigate them. In later chapters, I’ll identify a variety of steps you can take today to reduce these risks.
Next, let’s learn about a core technique used in defining appropriate experiments for language models: the scaling laws!
Language technique spotlight – causal modeling and the scaling laws
You’ve no doubt heard of the now-infamous model ChatGPT. For a few years, a San Francisco-based AI firm, OpenAI, developed research with a mission to improve humanity’s outcomes around artificial intelligence. Toward that end, they made bold leaps in scaling language models, deriving formulas as one might in physics to explain the performance of LLMs at scale. They originally positioned themselves as a non-profit, releasing their core insights and the code to reproduce them. Four years after its founding, however, they pivoted to cutting exclusive billion-dollar deals with Microsoft. Now, their 600-strong R&D teams focus on developing proprietary models and techniques, and many open source projects attempt to replicate and improve on their offerings. Despite this controversial pivot, the team at OpenAI gave the industry a few extremely useful insights. The first is GPT, and the second is the scaling laws.
As mentioned previously, GPT-based models use causal language modeling to learn how best to complete text. This means using a left-to-right completion learning criteria, which updates the model’s learnable parameters until the text is completely accurate. While the first GPT model of 2018 was itself useful, the real excitement came years later in two phases. First, Jared Kaplan lead a team at OpenAI to suggest a novel concept: using formulas inspired by his work in physics to estimate the impact the size of the model, dataset, and overall compute environment will have on the loss of the model. These Scaling Laws for Neural Language Models (9) suggested that the optimal model size for a given compute environment was massive.
The original GPT model of 2018 was only 117 million parameters, and its second version, aptly named GPT-2, increased the model size by up to 10x. This increase in parameter size more than doubled the overall accuracy of the model. Encouraged by these results, and fuelled by Kaplan’s theoretical and empirical findings, OpenAI boldly increased the model parameter size by another 10x, giving us GPT-3.
As the model increased in size, from 1.3 billion parameters to 13 billion, ultimately hitting 175 billion parameters, accuracy also took a huge leap! This result catalyzed the field of NLP, unleashing new use cases and a flurry of new work exploring and extending these impacts. Since then, new work has explored both larger (PaLM (9)) and smaller (Chinchilla (10)) models, with Chinchilla presenting an update to the scaling laws entirely. Yann LeCunn’s team at Meta has also presented smaller models that outperform the larger ones in specific areas, such as question-answering (Atlas (9)). Amazon has also presented two models that outperform GPT-3: the AlexaTM and MM-COT. Numerous teams have also undertaken efforts to produce open source versions of GPT-3, such as Hugging Face’s BLOOM, EleutherAI’S GPT-J, and Meta’s OPT.
The rest of this book is dedicated to discussing these models – where they come from, what they are good for, and especially how to train your own! While much excellent work has covered using these pretrained models in production through fine-tuning, such as Hugging Face’s own Natural Language Processing with Transformers (Tunstall et al., 2022), I continue to believe that pretraining your own foundation model is probably the most interesting computational intellectual exercise you can embark on today. I also believe it’s one of the most profitable. But more on that ahead!
Next, let’s learn about two key model components you’ll need to understand in detail: encoders and decoders.