Understanding LLMs
LLMs are deep neural networks that are adept at understanding and generating human language. These models have practical applications in fields like content creation and NLP, where the ultimate goal is to create algorithms capable of understanding and generating natural language text.
The current generation of LLMs such as GPT-4 and others are deep neural network architectures that utilize the transformer model and undergo pre-training using unsupervised learning on extensive text data, enabling the model to learn language patterns and structures. Models have evolved rapidly, enabling the creation of versatile foundational AI models that are suitable for a wide range of downstream tasks and modalities, ultimately driving innovation across various applications and industries.
The notable strength of the latest generation of LLMs as conversational interfaces (chatbots) lies in their ability to generate coherent and contextually appropriate responses, even in open-ended conversations. By generating the next word based on the preceding words repeatedly, the model produces fluent and coherent text that is often indistinguishable from text produced by humans.
At its core, language modeling, and more broadly NLP, relies heavily on the quality of representation learning. A generative language model encodes information about the text that it has been trained on and generates new text based on what it has learned, thereby taking on the task of text generation.
Representation learning is about a model learning its internal representations of raw data to perform a machine learning task, rather than relying only on engineered feature extraction. For example, an image classification model based on representation learning might learn to represent images according to visual features like edges, shapes, and textures. The model isn’t told explicitly what features to look for – it learns representations of the raw pixel data that help it make predictions.
Recently, LLMs have been used in tasks like copywriting, code development, translation, and understanding genetic sequences. More broadly, applications of language models involve multiple areas, such as:
- Question answering: AI chatbots and virtual assistants can provide personalized and efficient assistance, reducing response times in customer support and thereby enhancing customer experience. These systems can be used in specific contexts like restaurant reservations and ticket booking.
- Automatic summarization: Language models can create concise summaries of articles, research papers, and other content, enabling users to consume and understand information rapidly.
- Sentiment analysis: By analyzing opinions and emotions in texts, language models can help businesses understand customer feedback and opinions more efficiently.
- Topic modeling: LLMs can discover abstract topics and themes across a corpus of documents. They identify word clusters and latent semantic structures.
- Semantic search: LLMs can focus on understanding meaning within individual documents. They use NLP to interpret words and concepts for improved search relevance.
- Machine translation: Language models can translate texts from one language into another, supporting businesses in their global expansion efforts. New generative models can perform on par with commercial products (for example, Google Translate).
Despite their remarkable achievements, language models still face limitations when dealing with complex mathematical or logical reasoning tasks. It remains uncertain whether continually increasing the scale of language models will inevitably lead to new reasoning capabilities. Further, LLMs are known to return the most probable answers within the context, which can sometimes yield fabricated information, called hallucinations. This is a feature as well as a bug since it highlights their creative potential.
We’ll talk about hallucinations in Chapter 5, Building a Chatbot Like ChatGPT, but for now, let’s discuss the nitty-gritty details – how do these LLMs work under the hood?
How do GPT models work?
A new deep learning architecture called the Transformer emerged in 2017, introduced in an article by researchers at Google and the University of Toronto (in an article called Attention Is All You Need by Vaswani et al.). It uses self-attention, allowing it to focus on the important parts of a sentence and understand how words relate to each other.
In 2018, researchers took transformers to the next level by creating Generative Pre-trained Transformers (GPTs) (in Improving Language Understanding by Generative Pre-Training; Radford et al.). These models are trained by predicting the next word in a sequence, like a massive guessing game that helps them grasp language patterns. After this pre-training process, GPTs can be further refined for specific tasks like translation or sentiment analysis. This combines unsupervised learning (pre-training) and supervised learning (fine-tuning) for better performance across various tasks. It also reduces the difficulty of training LLMs.
Transformers
Models based on transformers outperformed previous approaches, such as using recurrent neural networks, particularly Long Short-Term Memory (LSTM) networks. These recurrent neural networks such as LSTM, have a limited memory. This can be problematic for long sentences or complex ideas where earlier information is still relevant.
Transformers work differently, which means they take advantage of the full context, and they can keep learning and refining their understanding as they process more words in a sentence. This ability to leverage the entire context throughout the sentence leads to better performance for tasks like translation, summarization, and question-answering. The model can capture the nuances of longer sentences and complex relationships between words. In essence, a key reason for the success of transformers has been their ability to maintain performance across long sequences better than other models, for example, recurrent neural networks.
The transformer model architecture has an encoder-decoder structure, where the encoder maps an input sequence to a sequence of hidden states, and the decoder maps the hidden states to an output sequence. The hidden state representations consider not only the inherent meaning of the words (their semantic value) but also their context in the sequence.
The encoder is made up of identical layers, each with two sub-layers. The input embedding is passed through an attention mechanism, and the second sub-layer is a fully connected feed-forward network. Each sub-layer is followed by a residual connection and layer normalization. The output of each sub-layer is the sum of the input and the output of the sub-layer, which is then normalized.
The decoder uses this encoded information to generate the output sequence one item at a time, using the context of the previously generated items. It also has identical modules, with the same two sub-layers as the encoder. In addition, the decoder has a third sub-layer that performs Multi-Head Attention (MHA) over the output of the encoder stack. The decoder also uses residual connections and layer normalization. The self-attention sub-layer in the decoder is modified to prevent positions from attending to subsequent positions. This masking, combined with the fact that the output embeddings are offset by one position, ensures that the predictions for position i can only depend on the known outputs at positions less than i. These are indicated in the diagram here (source: Yuening Jia, Wikimedia Commons):
Figure 1.4: The Transformer architecture
The architectural features that have contributed to the success of transformers are:
- Positional encoding: Since the transformer doesn’t process words sequentially but instead processes all words simultaneously, it lacks any notion of the order of words. To remedy this, information about the position of words in the sequence is injected into the model using positional encodings. These encodings are added to the input embeddings representing each word, thus allowing the model to consider the order of words in a sequence.
- Layer normalization: To stabilize the network’s learning, the transformer uses a technique called layer normalization. This technique normalizes the model’s inputs across the features dimension (instead of the batch dimension as in batch normalization), thus improving the overall speed and stability of learning.
- MHA: Instead of applying attention once, the transformer applies it multiple times in parallel, improving the model’s ability to focus on different types of information and thus capturing a richer combination of features.
The basic idea behind attention mechanisms is to compute a weighted sum of the values (usually referred to as values or content vectors) associated with each position in the input sequence, based on the similarity between the current position and all other positions. This weighted sum, known as the context vector, is then used as an input to the subsequent layers of the model, enabling the model to selectively attend to relevant parts of the input during the decoding process.
To enhance the expressiveness of the attention mechanism, it is often extended to include multiple so-called heads, where each head has its own set of query, key, and value vectors, allowing the model to capture various aspects of the input representation. The individual context vectors from each head are then concatenated or combined in some way to form the final output.
Early attention mechanisms scaled quadratically with the length of the sequences (context size), rendering them inapplicable to settings with long sequences. Different mechanisms have been tried out to alleviate this. Many LLMs use some form of Multi-Query Attention (MQA), including OpenAI’s GPT-series models, Falcon, SantaCoder, and StarCoder.
MQA is an extension of MHA, where attention computation is replicated multiple times. MQA improves the performance and efficiency of language models for various language tasks. By removing the heads dimension from certain computations and optimizing memory usage, MQA allows for 11 times better throughput and 30% lower latency in inference tasks compared to baseline models without MQA.
Llama 2 and a few other models use Grouped-Query Attention (GQA), which is a practice used in autoregressive decoding to cache the key (K) and value (V) pairs for the previous tokens in the sequence, speeding up attention computation. However, as the context window or batch sizes increase, the memory costs associated with the KV cache size in MHA models also increase significantly. To address this, the key and value projections can be shared across multiple heads without much degradation of performance.
There have been many other proposed approaches to obtain efficiency gains, such as sparse, low-rank self-attention, and latent bottlenecks, to name just a few. Other work has tried to extend sequences beyond the fixed input size; architectures such as transformer-XL reintroduce recursion by storing hidden states of already encoded sentences to leverage them in the subsequent encoding of the next sentences.
The combination of these architectural features allows GPT models to successfully tackle tasks that involve understanding and generating text in human language and other domains. The overwhelming majority of LLMs are transformers, as are many other state-of-the-art models we will encounter in the different sections of this chapter, including models for image, sound, and 3D objects.
As the name suggests, a particularity of GPTs lies in pre-training. Let’s see how these LLMs are trained!
Pre-training
The transformer is trained in two phases using a combination of unsupervised pre-training and discriminative task-specific fine-tuning. The goal during pre-training is to learn a general-purpose representation that transfers to a wide range of tasks.
The unsupervised pre-training can follow different objectives. In Masked Language Modeling (MLM), introduced in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Devlin and others (2019), the input is masked out, and the model attempts to predict the missing tokens based on the context provided by the non-masked portion. For example, if the input sentence is “The cat [MASK] over the wall,” the model would ideally learn to predict “jumped” for the mask.
In this case, the training objective minimizes the differences between predictions and the masked tokens according to a loss function. Parameters in the models are then iteratively updated according to these comparisons.
Negative Log-Likelihood (NLL) and Perplexity (PPL) are important metrics used in training and evaluating language models. NLL is a loss function used in ML algorithms, aimed at maximizing the probability of correct predictions. A low NLL indicates that the network has successfully learned patterns from the training set, so it will accurately predict the labels of the training samples. It’s important to mention that NLL is a value that’s constrained within a positive interval.
PPL, on the other hand, is an exponentiation of NLL, providing a more intuitive way to understand the model’s performance. Small PPL values indicate a well-trained network that can predict accurately, while high values indicate poor learning performance. Intuitively, we could say that a low PPL means that the model is not surprised by the next word. Therefore, the goal in pre-training is to minimize PPL, which means the model’s predictions align more with the actual outcomes.
In comparing different language models, PPL is often used as a benchmark metric across various tasks. It gives us an idea of how well the language model is performing in that a lower PPL indicates the model is more certain of its predictions. Hence, a model with low PPL would be considered better performing than a model with high PPL.
The first step in training an LLM is tokenization. This process involves building a vocabulary, which maps tokens to unique numerical representations so that they can be processed by the model, given that LLMs are mathematical functions that require numerical inputs and outputs.
Tokenization
Tokenizing a text means splitting it into tokens (words or subwords), which then are converted to IDs through a look-up table mapping words in text to corresponding lists of integers.
Before training the LLM, the tokenizer – more precisely, its dictionary – is typically fitted to the entire training dataset and then frozen. It’s important to note that tokenizers do not produce arbitrary integers. Instead, they output integers within a specific range – from 0 to N, where N represents the vocabulary size of the tokenizer.
Definitions
- A token is an instance of a sequence of characters, typically forming a word, punctuation mark, or number. Tokens serve as the base elements for constructing sequences of text.
- Tokenization refers to the process of splitting text into tokens. A tokenizer splits on whitespace and punctuation to break text into individual tokens.
Examples
Consider the following text:
“The quick brown fox jumps over the lazy dog!”
This would get split into the following tokens:
[“The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”, “!”]
Each word is an individual token, as is the punctuation mark.
There are a lot of tokenizers that work according to different principles, but common types of tokenizers employed in models are Byte-Pair Encoding (BPE), WordPiece, and SentencePiece. For example, Llama 2’s BPE tokenizer splits numbers into individual digits and uses bytes to decompose unknown UTF-8 characters. The total vocabulary size is 32,000 tokens.
It is necessary to point out that LLMs can only generate outputs based on a sequence of tokens that does not exceed its context window. This context window refers to the length of the longest sequence of tokens that an LLM can use. Typical context window sizes for LLMs can range from about 1,000 to 10,000 tokens.
After pre-training, a major step is how models are prepared for specific tasks either by fine-tuning or prompting. Let’s see what this task conditioning is about!
Conditioning
Conditioning LLMs refers to adapting the model for specific tasks. It includes fine-tuning and prompting:
- Fine-tuning involves modifying a pre-trained language model by training it on a specific task using supervised learning. For example, to make a model more amenable to chats with humans, the model is trained on examples of tasks formulated as natural language instructions (instruction tuning). For fine-tuning, pre-trained models are usually trained again using Reinforcement Learning from Human Feedback (RLHF) to be helpful and harmless.
- Prompting techniques present problems in text form to generative models. There are a lot of different prompting techniques, from simple questions to detailed instructions. Prompts can include examples of similar problems and their solutions. Zero-shot prompting involves no examples, while few-shot prompting includes a small number of examples of relevant problem and solution pairs.
These conditioning methods continue to evolve, becoming more effective and useful for a wide range of applications. Prompt engineering and fine-tuning methods will be explored further in Chapter 8, Customizing LLMs and Their Output.
How have GPT models evolved?
The development of GPT models has seen considerable progress, with OpenAI’s GPT-n series leading the way in creating foundational AI models. A major driver has been the size of models in terms of their parameters; however, other drivers play a role as well.
A foundation model (sometimes known as a base model) is a large model that was trained on an immense quantity of data at scale so that the model can be adapted to a wide range of downstream tasks. In GPT models, this pre-training is done via self-supervised learning.
There has been a recent shift in focus towards exploring alternative approaches to improve model performance on benchmarks like MMLU, beyond simply scaling up the model size. A critical area of focus has been the curation and quality of the training data. Carefully selecting and filtering the training data to ensure its relevance, diversity, and quality can significantly impact the model’s performance, especially on benchmarks that test for a broad range of knowledge and reasoning abilities.
Another key area of innovation has been in model architectures. For example, the Mixtral and Leeroo models employ a mixture-of-experts approach, where different subsets of the model’s parameters are specialized for different tasks, potentially improving performance and computational efficiency.
By exploring these alternative approaches in conjunction with continued scaling efforts, the field is striving to develop language models with even more robust language understanding and reasoning abilities across diverse domains.
The computational requirements and the cost of the model training have been enormous and will probably increase in the future. The computational cost of LLMs is enough to make your wallet weep. But fear not! Before we explore ways to lighten the load, let’s explore what makes these models so weighty in the first place: their size!
Model size
The size of the training corpus for LLMs has been increasing drastically. GPT-1, introduced by OpenAI in 2018, was trained on BookCorpus, which has 985 million words. BERT, released in the same year, was trained on a combined corpus of BookCorpus and English Wikipedia, totaling 3.3 billion words. Now, training corpora for LLMs have up to trillions of tokens.
OpenAI has been coy about the technical details of their models; however, information has been circulating that, with about 1.8 trillion parameters, GPT-4 is more than 10x the size of GPT-3. Further, OpenAI was able to keep costs reasonable by utilizing a Mixture of Experts (MoE) model consisting of 16 experts within their model, each having about 111 billion parameters.
Apparently, GPT-4 was trained on about 13 trillion tokens. However, these are not unique tokens since they count repeated presentation of the data in each epoch. Training was conducted for two epochs for text-based data and four for code-based data. For fine-tuning, the dataset consisted of millions of rows of instruction fine-tuning data. Another rumor, again to be taken with a pinch of salt, is that OpenAI might be applying speculative decoding on GPT-4’s inference, with the idea that a smaller model (oracle model) could be predicting the large model’s responses, and these predicted responses could help speed up decoding by feeding them into the larger model, thereby skipping tokens. This is a risky strategy because – depending on the threshold of the confidence of the oracle’s responses – the quality could deteriorate.
The increase in the scale of language models has been a major driving force behind their impressive performance gains, with models like Google’s Gemini continuing to push the boundaries of size and capability. This graph illustrates how LLMs have been growing:
Figure 1.5: LLMs from BERT to GPT-4 – size (number of parameters), and licenses. For proprietary models, parameter sizes are often estimates
In examining the historical progression depicted in the graph, it is evident that LLMs have consistently increased in size, as indicated by the growing number of parameters. This trend aligns with a broader pattern observed in machine learning, where enhancing model performance often involves expanding model size. A paper from 2020 from OpenAI by Kaplan et al. (Scaling laws for neural language models, 2020) discussed scaling laws and the choice of parameters.
They identified a power-law relationship indicating that performance improvements in LLMs are proportional to increases in dataset size and model size. Specifically, to enhance performance by a certain factor, the size of the dataset or the model must be exponentially increased by that factor. For optimal results, both elements should be scaled simultaneously, thus preventing potential bottlenecks in model training and performance.
In addition to dataset and model size, it is essential to consider the training budget, which significantly influences the training process’s efficiency and outcomes. The training budget encompasses factors such as computational power and time allocated for model training. This metric serves as an alternative to measuring training in terms of epochs, allowing more flexibility and precision in determining the optimal point to cease training. Given the complexity and extensive training requirements of LLMs, it can be challenging to pinpoint the precise convergence point. Thus, the training budget plays a crucial role in efficiently managing resources while striving for the highest model performance.
Researchers at DeepMind (An empirical analysis of compute-optimal large language model training; Hoffmann et al., 2022) analyzed the training compute and dataset size of LLMs and concluded that LLMs are undertrained in terms of compute budget and dataset size as suggested by scaling laws. They predicted that large models would perform better if they were substantially smaller and trained for much longer, and – in fact – validated their prediction by comparing a 70-billion-parameter Chinchilla model on a benchmark to their Gopher model, which consists of 280 billion parameters.
However, more recently, a team at Microsoft Research has challenged these conclusions and surprised everyone (Textbooks Are All You Need; Gunasekar et al., June 2023), finding that small networks trained on high-quality datasets can give very competitive performance – their model phi-1-small only comprises 350 million parameters! We’ll discuss this model again in Chapter 6, Developing Software with Generative AI, and we’ll discuss the implications of scaling in Chapter 10, The Future of Generative Models.
We could see new scaling laws linking performance with data quality, and it will be instructive to observe whether model sizes for LLMs keep increasing at the same rate as they have. This is an important question since it determines if the development of LLMs will be firmly in the hands of large organizations. It could be that there’s a saturation of performance at a certain size, which only changes in the approach can overcome. We haven’t seen this leveling off yet, though.
The GPT model series
Trained on 300 billion tokens, GPT-3 has 175 billion parameters, an unprecedented size for DL models. GPT-4 is the most recent in the series, though its size and training details have not been published due to competitive and safety concerns. However, different estimates suggest it has between 200 and 500 billion parameters. Sam Altman, the CEO of OpenAI, has stated that the cost of training GPT-4 was more than $100 million.
ChatGPT, launched by OpenAI in November 2022, stands out as a conversational model developed on the foundation of earlier GPT models, notably GPT-3. It is specifically tailored for dialogue, employing a mix of role-playing scenarios by humans and examples to guide the model towards desired behaviors, significantly enhanced by the use of Reinforcement Learning from Human Feedback (RLHF). Instead of learning from a pre-set reward based on task performance, RLHF trains a model using feedback from humans to understand what good (high reward) and bad (low reward) responses look like. RLHF has proven effective in making AI models more aligned with human values and preferences, applied to fields like conversational agents and computer vision.
The introduction of GPT-4 in March 2023 marked a further leap in capabilities. GPT-4 provides superior performance on various evaluation tasks, coupled with significantly better response avoidance to malicious or provocative queries due to six months of iterative alignment during training.
The following diagram shows the timeline of the different model iterations:
Figure 1.6: The development of the OpenAI GPT model series
There’s also a multi-modal version of GPT-4 that incorporates a separate vision encoder, trained on joined image and text data, giving the model the capability to read web pages and transcribe what’s in images and video.
As can be seen in Figure 1.5, there are quite a few both open-source and closed-source models besides OpenAI’s, some of which have come close to OpenAI models in performance, which we will have a look at.
PaLM and Gemini
PaLM 2, released in May 2023, was trained with the aim of improving multilingual and reasoning capabilities while being more compute efficient. Using evaluations at different compute scales, the authors (Anil et al.; PaLM 2 Technical Report) estimated an optimal scaling of training data sizes and parameters. PaLM 2 is small and exhibits fast and efficient inference, allowing broad deployment and fast response times for a natural pace of interaction. Extensive benchmarking across different model sizes has shown that PaLM 2 has significantly improved quality on downstream tasks, including multilingual common sense and mathematical reasoning, coding, and natural language generation, compared to its predecessor PaLM.
PaLM 2 was also tested on various professional language proficiency exams. The exams used were for Chinese (HSK 7-9 Writing and HSK 7-9 Overall), Japanese (J-Test A-C Overall), Italian (PLIDA C2 Writing and PLIDA C2 Overall), French (TCF Overall), and Spanish (DELE C2 Writing and DELE C2 Overall). Across these exams, which were designed to test C2-level proficiency, considered mastery or advanced professional level according to the Common European Framework of Reference for Languages (CEFR), PaLM 2 achieved mostly high-passing grades.
Gemini, released by Google in December 2023, is a family of highly capable multimodal models jointly trained on image, audio, video, and text data. The largest version, Gemini Ultra, sets new state-of-the-art results across 30 benchmarks spanning language, coding, reasoning, and multimodal tasks like MMMU (Massive Multi-discipline Multimodal). It demonstrates impressive crossmodal reasoning capabilities, understanding, and reasoning across different modalities like text, images, and audio.
Llama and Llama 2
The releases of the Llama and Llama series of models, with up to 70 billion parameters, by Meta AI in February and July 2023, respectively, have been highly influential by enabling the community to build on top of them, thereby kicking off a Cambrian explosion of open-source LLMs. Llama triggered the creation of models such as Vicuna, Koala, RedPajama, MPT, Alpaca, and Gorilla. Llama 2, since its recent release, has already inspired several very competitive coding models, such as WizardCoder.
Optimized for dialogue use cases, at their release, these LLMs outperformed other open-source chat models on most benchmarks and seem on par with some closed-source models based on human evaluations. The Llama 2 70B model performs on a par with or better than PaLM (540 billion parameters) on almost all benchmarks, but there is still a large performance gap between Llama 2 70B and GPT-4 and PaLM-2-L.
Llama 2 is an updated version of Llama 1 trained on a new mix of publicly available data. The pre-training corpus size has increased by 40% (2 trillion tokens of data), the context length of the model has doubled, and grouped-query attention has been adopted. Variants of Llama 2 with different parameter sizes (7 billion, 13 billion, 34 billion, and 70 billion) have been released. While Llama was released under a non-commercial license, Llama 2 is open to the general public for research and commercial use.
Llama 2-Chat has undergone safety evaluation results compared to other open-source and closed-source models. Human raters judged the safety violations of model generations across approximately 2,000 adversarial prompts, including both single and multi-turn prompts.
Claude 1–3
Claude, Claude 2, and Claude 3 are AI assistants created by Anthropic. Claude 2 improved upon previous versions in areas like helpfulness, honesty, and reduced bias. Key enhancements include a massively increased context window of up to 200,000 tokens and strong performance on coding, summarization, and long document understanding tasks.
The latest release is Claude 3, a new family of large multimodal models, including the flagship Claude 3 Opus (the most capable), Claude 3 Sonnet (balanced skills and speed), and Claude 3 Haiku (the fastest and least expensive). With vision capabilities, they demonstrate strong performance across benchmarks, including MMLU. Notably, Claude 3 Opus surpassed OpenAI’s GPT-4 on the Chatbot Arena leaderboard, while exhibiting improved multilingual fluency.
Mixture of Experts (MoE)
Recently, MoE models have had success with high performance at low usage of resources. Mixtral 8x7B by Mistral AI is a Sparse MOE model that outperforms or matches Llama 2 70B and GPT-3.5 across various benchmarks, excelling particularly in math, code generation, and multilingual tasks. Its instruction-tuned version, Mixtral 8x7B-Instruct, surpasses several other prominent models, like GPT-3.5 Turbo and Claude-2.1, on human benchmarks.
Grok-1 is a 314-billion-parameter MoE LLM trained from scratch by xAI and released under the Apache 2.0 license. With 25% of its weights active on a given token, this raw model provides a foundation for further fine-tuning and customization by researchers and developers. xAI trained Grok-1 using their custom training stack built on top of JAX and Rust, showcasing their expertise in developing cutting-edge language models at massive scales. The Leeroo Orchestrator by Leeroo proposes an architecture that integrates multiple LLMs to create a new state-of-the-art model. It achieves performance on par with Mixtral at lower computational cost and even exceeds GPT-4’s accuracy on the MMLU benchmark with further cost reductions.
DBRX is Databricks’ open LLM that establishes open LLMs across standard benchmarks. It surpasses GPT-3.5 and is competitive with Gemini 1.0 Pro, excelling as a code model by outperforming specialized models like CodeLLaMA-70B. DBRX advances efficiency with a fine-grained MoE architecture, offering up to 2x faster inference than LLaMA2-70B and a 40% smaller parameter count than Grok-1. Hosted on Mosaic AI Model Serving, it can generate text at up to 150 tokens/sec/user. Training is about 2x more computationally efficient than dense models for the same quality, achieving previous MPT model quality with nearly 4x less compute overall. Other notable models contributing to LLM advancements include DeepMind’s Chinchilla, Meta’s OPT, Google’s Gopher, Hugging Face’s BLOOM, and various models from research groups like EleutherAI’s GPT-NeoX.
How to use LLMs
You can access LLMs by OpenAI, Google, and Anthropic through their website or their API. If you want to try other LLMs on your laptop, open-source LLMs are a good place to get started. There is a whole model zoo out there! You can access these models through Hugging Face or other providers, as we’ll see starting in Chapter 3, Getting Started with LangChain. You can even download these open-source models, fine-tune them, or fully train them. We’ll fine-tune a model in Chapter 8, Customizing LLMs and Their Output.
The different licenses for LLMs significantly impact how they can be used, modified, and further developed for commercial or research purposes. Some code for training, training datasets, and the weights themselves have been made available to the community to run locally, poke into for investigations, further develop, fine-tune, and improve upon. Other models have been kept behind APIs and the secrets behind their performance are a matter of rumors and speculation. Here’s a breakdown of some key license types and their implications.
Open source licenses (for example, Apache 2.0, MIT):
- Allow free use, modification, and redistribution for both commercial and non-commercial purposes
- Permit the creation of derivative works and the integration of the models into products/services
- Research institutions and commercial entities can build upon and extend these models
- Examples: BERT, Mistral
Non-commercial licenses (for example, CC-BY-NC-4.0, non-commercial research):
- Permit use and modification only for non-commercial research purposes
- Commercial entities cannot directly use or integrate these models into products/services
- Researchers can study, evaluate, and build upon the models within academic settings
- Examples: Galactica, OPT, Llama 60B
Proprietary licenses:
- Models are closed-source and cannot be freely used, modified, or redistributed
- Commercial entities retain full control and can monetize the models as products/services
- Research institutions may have limited access for evaluation/benchmarking purposes
- Examples: GPT-4, Claude, Gemini
Licenses like the Databricks Open Model License and Llama 2 Community License:
- Allow the use, modification, and creation of derivative works for both commercial and non-commercial purposes
- But may place certain conditions on redistribution, indemnification, or usage tracking
- Strike a balance between open access and commercial interests
In general, open source licenses promote wide adoption, collaboration, and innovation around the models, benefiting both research and commercial development. Proprietary licenses give companies exclusive control but may limit academic research progress. Non-commercial licenses restrict commercial use while enabling research. New licenses aim to mitigate these trade-offs.
In the next section, we’ll be reviewing state-of-the-art methods for text-conditioned image generation. I’ll highlight the progress made in the field so far, but also discuss existing challenges and potential future directions.