Choosing an LLM adaptation method
We have covered various LLM adaptation methods, including prompt engineering, domain adaptation pre-training, fine-tuning, and RAG. All these methods are intended to get better responses from the pre-trained LLMs. With all these options, it leaves one wondering: how do we choose which method to use?
Let’s break down some of the considerations when choosing these different methods.
Response quality
Response quality measures how accurately the LLM response is aligned with the intent of the user queries. The evaluation of response quality can be intricate for different use cases, as there are different considerations for evaluating response quality, such as knowledge domain affinity, task accuracy, up-to-date data, source data transparency, and hallucination.
For knowledge domain affinity, domain adaptation pre-training can be used to effectively teach LLM domain-specific knowledge and terminology. RAG is efficient in retrieving relevant data, but the LLM used for the response synthetization may not capture domain-specific patterns, terminology, and nuance as well as fine-tuned or domain adaptation pre-training models. If you need strong domain-specific performance, you want to consider domain adaptation pre-training.
If you need to maximize accuracy for specific tasks, then fine-tuning is the recommended approach. Prompt engineering can also help improve task accuracy through single-shot or few-shot prompting techniques, but it is prompt-specific and does not generalize across different prompts.
If information freshness in the response is the primary goal, then RAG is the ideal solution since it has access to dynamic external data sources. Prompt engineering can also help with data freshness when up-to-date knowledge is provided as part of the prompt. Fine-tuning and domain adaptation pre-training have knowledge cutoffs based on the latest training dataset used.
For some applications such as medical diagnosis or financial analysis, knowing how the decisions were made and what data sources were used in making the decision is crucial. If this is a critical requirement for the use case, then RAG is the clear choice here, as RAG can provide references to the knowledge it used for constructing the response. Fine-tuning and domain adaptation pre-training behave more like a “black box,” often obscuring what data sources are used for decision-making.
As mentioned in the previous chapter, LLMs sometimes generate inaccurate responses that are not grounded in their training data or user input when they encounter unfamiliar queries and hallucinate plausible but false information. Fine-tuning can reduce fabrication by focusing the model on domain-specific knowledge. However, the risk remains for unfamiliar inputs. RAG systems better address hallucination risks by anchoring responses to retrieved documents. The initial retrieval step acts as a fact check, finding relevant passages to ground the response in real data. Subsequent generation is confined within the context of the retrievals rather than being unconstrained. This mechanism minimizes fabricated responses not supported by data.
Cost of the adaptation
When evaluating LLM adaptation approaches, it is important to consider both initial implementation costs as well as long-term maintenance costs. With this in mind, let’s compare the costs of the different approaches.
Prompt engineering has the lowest overhead, involving simply writing and testing prompts to yield good results from the pre-trained language model. Maintenance may require occasional prompt updates as the foundation model is updated over time.
RAG systems have moderately high startup costs due to requiring multiple components – embeddings, vector stores, retrievers, and language models. However, these systems are relatively static over time.
Full fine-tuning and domain adaptation pre-training can be expensive, needing massive computational resources and time to completely update potentially all parameters of a large foundation model, as well as the cost of dataset preparation. Parameter Efficient Fine-Tuning (PEFT) can be cheaper than full fine-tuning and domain adaptation pre-training. However, it is still considered more expensive than RAG due to the requirement for high-quality dataset preparation and training resource requirements.
Implementation complexity
The implementation complexity varies significantly across different techniques, from straightforward to highly advanced configurations.
Prompt engineering has relatively low complexity, requiring mainly language skills and few-shot learning familiarity to craft prompts that elicit good performance from the foundation model. There are minimal requirements for programming skills and science knowledge.
RAG systems have moderate complexity, needing software engineering to build the pipeline components like retrievers and integrators. The complexity rises with advanced RAG configurations and infrastructure, such as complex workflows involving agents and tools, and infrastructure components for monitoring, observability, evaluation, and orchestration.
PEFT and full model fine-tuning have the highest complexity. These require deep expertise in deep learning, NLP, and data science to select training data, write tuning scripts, choose hyperparameters like learning rates, loss functions, etc., and ultimately update the model’s internal representations.