Comparing RAG with model fine-tuning
LLMs can be adapted to your data in two ways:
- Fine-tuning: With fine-tuning, you are adjusting the weights and/or biases that define the model’s intelligence based on new training data. This directly impacts the model, permanently changing how it will interact with new inputs.
- Input/prompts: This is where you use the model, using the prompt/input to introduce new knowledge that the LLM can act upon.
Why not use fine-tuning in all situations? Once you have introduced the new knowledge, the LLM will always have it! It is also how the model was created – by being trained with data, right? That sounds right in theory, but in practice, fine-tuning has been more reliable in teaching a model specialized tasks (such as teaching a model how to converse in a certain way), and less reliable for factual recall.
The reason is complicated, but in general, a model’s knowledge of facts is like a human’s long-term memory. If you memorize a long passage from a speech or book and then try to recall it a few months later, you will likely still understand the context of the information, but you may forget specific details. On the other hand, adding knowledge through the input of the model is like our short-term memory, where the facts, details, and even the order of wording are all very fresh and available for recall. It is this latter scenario that lends itself better in a situation where you want successful factual recall. And given how much more expensive fine-tuning can be, this makes it that much more important to consider RAG.
There is a trade-off, though. While there are generally ways to feed all data you have to a model for fine-tuning, inputs are limited by the context window of the model. This is an area that is actively being addressed. For example, early versions of ChatGPT 3.5 had a 4,096 token context window, which is the equivalent of about five pages of text. When ChatGPT 4 was released, they expanded the context window to 8,192 tokens (10 pages) and there was a Chat 4-32k version that had a context window of 32,768 tokens (40 pages). This issue is so important that they included the context window size in the name of the model. That is a strong indicator of how important the context window is!
Interesting fact!
What about the latest Gemini 1.5 model? It has a 1 million token context window or over 1,000 pages!
As the context windows expand, another issue is created. Early models with expanded context windows were shown to lose a lot of the details, especially in the middle of the text. This issue is also being addressed. The Gemini 1.5 model with its 1 million token context window has performed well in tests called needle in a haystack tests for remembering all details well throughout the text it can take as input. Unfortunately, the model did not perform as well in the multiple needles in a haystack tests. Expect more effort in this area as these context windows become larger. Keep this in mind if you need to work with large amounts of text at a time.
Note
It is important to note that token count differs from word count as tokens include punctuation, symbols, numbers, and other text representations. How a compound word such as ice cream is treated token-wise depends on the tokenization scheme and it can vary across LLMs. But most well-known LLMs (such as ChatGPT and Gemini) would consider ice cream as two tokens. Under certain circumstances in NLP, you may argue that it should be one token based on the concept that a token should represent a useful semantic unit for processing, but that is not the case for these models.
Fine-tuning can also be quite expensive, depending on the environment and resources you have available. In recent years, the costs for fine-tuning have come down substantially due to new techniques such as representative fine-tuning, LoRA-related techniques, and quantization. But in many RAG development efforts, fine-tuning is considered an additional cost to already expensive RAG efforts, so it is considered a more expensive addition to the efforts.
Ultimately, when deciding between RAG and fine-tuning, consider your specific use case and requirements. RAG is generally superior for retrieving factual information that is not present in the LLM’s training data or is private. It allows you to dynamically integrate external knowledge without modifying the model’s weights. Fine-tuning, on the other hand, is more suitable for teaching the model specialized tasks or adapting it to a specific domain. Keep the limitations of context window sizes and the potential for overfitting in mind when fine-tuning a specific dataset.
Now that we have defined what RAG is, particularly when compared to other approaches that use generative AI, let’s review the general architecture of RAG systems.