Introducing generative AI
In the media, there is substantial coverage of AI-related breakthroughs and their potential implications. These range from advancements in Natural Language Processing (NLP) and computer vision to the development of sophisticated language models like GPT-4. Particularly, generative models have received a lot of attention due to their ability to generate text, images, and other creative content that is often indistinguishable from human-generated content. These same models also provide wide functionality, including semantic search, content manipulation, and classification. This allows cost savings with automation and allows humans to leverage their creativity to an unprecedented level.
Generative AI refers to algorithms that can generate novel content, as opposed to analyzing or acting on existing data like more traditional, predictive machine learning or AI systems.
Benchmarks capturing task performance in different domains have been major drivers of the development of these models. The Massive Multitask Language Understanding (MMLU) benchmark is a comprehensive suite of 57 tasks spanning diverse domains like math, history, computer science, and law. It serves as a standardized way to evaluate the multitask performance and broad capabilities of LLMs in both zero-shot and few-shot settings. The MMLU benchmark’s importance lies in providing a challenging and multifaceted test of a model’s understanding and problem-solving abilities across a wide range of topics. It allows for systematic comparisons between different LLMs and tracks progress in developing models with robust language understanding and reasoning skills beyond narrow domains.
The following graph, inspired by a blog post titled GPT-4 Predictions by Stephen McAleese on LessWrong, shows the improvements of LLMs in the benchmark:
Figure 1.1: Average performance on the MMLU benchmark of LLMs
Please note that results should be taken with a pinch of salt since they are self-reported and are obtained either by 5-shot or 0-shot conditioning. Most benchmark results come from 5-shot (indicated by an “o”). A few, like the GPT-2, PaLM, and PaLM-2 results, refer to zero-shot (“x”).
From the preceding graph, we can see significant improvements in recent years in the MMLU benchmark. Particularly, it highlights the progress of the models provided through a public user interface by OpenAI, especially the improvements between releases, from GTP-2 to GPT-3 and GPT-3.5 to GPT-4.
The graph shows the MMLU performance of models that have either prompted a question directly (zero-shot) or together with examples – typically 5 (few-shot). The added examples result in a 20% boost in the model’s performance according to Measuring Massive Multitask Language Understanding (Hendrycks et al., revised in 2023).
It is difficult to definitively declare the strongest LLM among Claude 3, GPT-4, and Gemini, as their performances appear to be closely matched and vary across different tasks. Ultimately, the choice of the strongest LLM may depend on specific use cases and requirements, including their costs.
There are a few differences between these models and the way they are trained that can account for differences in performance, such as scale, instruction tuning, a tweak to the attention mechanisms, and the choice of training data. First and foremost, the massive scaling up of parameters from 1.5 billion (GPT-2) to 175 billion (GPT-3) to more than a trillion (GPT-4) enables models to learn more complex patterns; however, another major change in early 2022 was the post-training fine-tuning of models based on human instructions, which teaches the model how to perform a task by providing demonstrations and feedback.
Across benchmarks, a few models have recently started to perform better than an average human rater, but generally, they still haven’t reached the performance of a human expert. These achievements of human engineering are impressive; however, it should be noted that the performance of these models depends on the field; most models are still performing poorly on the GSM8K benchmark of grade school math word problems. As AI models like OpenAI’s GPT continue to improve, they could become indispensable assets to teams in need of diverse knowledge and skills.
You could consider strong LLMs like GPT 4 or Claude 3 a polymath that works tirelessly without demanding compensation (beyond subscription or API fees), providing competent assistance in subjects like mathematics and statistics, macroeconomics, biology, and law (the model performs well on the Uniform Bar Exam). As these AI models become more proficient and easily accessible, they are likely to play a significant role in shaping the future of work and learning.
By making knowledge more accessible and adaptable, these models have the potential to level the playing field and create new opportunities for people from all walks of life. These models have shown potential in areas that require high levels of reasoning and understanding, although progress varies depending on the complexity of the tasks involved.
As for generative models with images, they have pushed the boundaries in their capabilities to assist in creating visual content, and their performance in computer vision tasks such as object detection, segmentation, captioning, and much more.
Let’s clear up the terminology a bit and explain in more detail what is meant by generative models, artificial intelligence, deep learning, and machine learning.
What are generative models?
In popular media, the term artificial intelligence is used a lot when referring to these new models. In theoretical and applied research circles, it is often joked that AI is just a fancy word for ML, or AI is ML in a suit, as illustrated in this image:
Figure 1.2: ML in a suit. Generated by a model on replicate.com, Diffusers Stable Diffusion v2.1
It’s worth distinguishing more clearly between the terms generative model, artificial intelligence, machine learning, deep learning, and language model:
- Artificial Intelligence (AI) is a broad field of computer science focused on creating intelligent agents that can reason, learn, and act autonomously.
- Machine Learning (ML) is a subset of AI focused on developing algorithms that can learn from data.
- Deep Learning (DL) uses deep neural networks, which have many layers, as a mechanism for ML algorithms to learn complex patterns from data.
- Generative Models are a type of ML model that can generate new data based on patterns learned from input data.
- Language Models (LMs) are statistical models used to predict words in a sequence of natural language. Some language models utilize deep learning and are trained on massive datasets, becoming LLMs.
The following class diagram illustrates how LLMs combine deep learning techniques like neural networks with sequence modeling objectives from language modeling at a very large scale:
Figure 1.3: Class diagram of different models. LLMs represent the intersection of deep learning techniques with language modeling objectives
Generative models are a powerful type of AI that can generate new data that resembles the training data. Generative AI models have come a long way, enabling the generation of new examples from scratch using patterns in data. These models can handle different data modalities and are employed across various domains, including text, image, music, and video. Their key distinction is that generative models synthesize new data rather than just making predictions or decisions. This enables applications like generating text, images, music, and video.
Generative models can facilitate the creation of synthetic data to train AI models when real data is scarce or restricted. This type of data generation reduces labeling costs and improves training efficiency. Microsoft Research took this approach (Textbooks Are All You Need, June 2023) when training their phi-1 model; they used GPT-3.5 to create synthetic Python textbooks and exercises.
The rapid progress across diverse domains shows the potential of generative AI. Within the industry, there is a growing sense of excitement around AI’s capabilities and its potential impact on business operations. But there are key challenges such as data availability, compute requirements, bias in data, evaluation difficulties, potential misuse, and other societal impacts that need to be addressed going forward, which we’ll discuss in Chapter 10, The Future of Generative Models.
Generative AI is extensively used in generating 3D images, avatars, videos, graphs, and illustrations for virtual or augmented reality, video games graphic design, logo creation, image editing, or enhancement. The most popular model category here is for text-conditioned image synthesis, specifically text-to-image generation. As mentioned, in this book, we’ll focus on LLMs, since they have the broadest practical application, but we’ll also have a look at image models, which sometimes can be quite useful.
Let’s delve a bit more into this progress and pose the question why is it happening now and what conditions have made this advancement possible?
Why now?
The success of generative AI is due to several factors, including:
- Improved algorithms
- Considerable advances in computer power and hardware design
- The availability of large, labeled datasets
- An active and collaborative research community
Additionally, the development of more sophisticated mathematical and computational methods has played a vital role in advancing generative models. An example is the backpropagation algorithm, which was introduced in the 1980s and provides a way to effectively train multi-layer neural networks.
In the 2000s, neural networks began to regain popularity as researchers developed more complex architectures. However, it was the advent of deep learning, a type of neural network with numerous layers, that marked a significant turning point in the performance and capabilities of these models.
Although the concept of deep learning has existed for some time, the development and expansion of generative models correlate with significant advances in hardware, particularly Graphics Processing Units (GPUs), which have been instrumental in the development of deeper models. This is because deep learning models require a lot of computing power to train and run. This concerns all aspects of processing power, memory, and disk space.
The capabilities of LLMs changed dramatically once they became bigger. The more parameters a model has, the higher its capacity to capture knowledge relationships between words and phrases. As a simple example of these higher-order correlations, an LLM could learn that the word “cat” is more likely to be followed by the word “dog” if it is preceded by the word “chase,” even if there are other words in between. Generally, the lower a model’s perplexity, the better it will perform, for example, in terms of answering questions.
Particularly, it seems that in models with between 2 and 7 billion parameters, new capabilities emerge such as the ability to generate different creative text in formats like poems, code, scripts, musical pieces, emails, and letters, and to answer even open-ended and challenging questions in an informative way.