Model evaluation
In model evaluation, the objective is to assess the capabilities of a single model without any prompt engineering, RAG pipeline, and so on.
This evaluation is essential for several reasons, such as selecting the most relevant LLM or making sure that the fine-tuning process actually improved the model. In this section, we will compare ML and LLM evaluation to understand the main differences between these two fields. We will then explore benchmarks for general-purpose, domain-specific, and task-specific models.
Comparing ML and LLM evaluation
ML evaluation is centered on assessing the performance of models designed for tasks like prediction, classification, and regression. Unlike the evaluation of LLMs, which often focuses on how well a model understands and generates language, ML evaluation is more concerned with how accurately and efficiently a model can process structured data to produce specific outcomes.
This difference comes from the nature of the tasks these models handle. ML models are generally designed for narrowly defined problems, such as predicting stock prices or detecting outliers, which often involve numerical or categorical data, making the evaluation process more straightforward. On the other hand, LLMs are tasked with interpreting and generating language, which adds a layer of subjectivity to the evaluation process. Instead of relying solely on numerical benchmarks, LLM evaluation requires a more nuanced approach and often incorporates qualitative assessments, examining how well the model produces coherent, relevant, and contextually accurate responses in natural language.
In particular, we can see three key differences in how these models work, which impact the evaluation process:
- Numerical metrics: Evaluating ML models typically involves measuring objective performance metrics, such as accuracy, precision, recall, or mean squared error, depending on the type of task at hand. This is less clear with LLMs, which can handle multiple tasks (hence, multiple evaluations) and can rarely rely on the same numerical metrics.
- Feature engineering: In traditional ML, a critical part of the process involves manually selecting and transforming relevant data features before training the model. Evaluating the success of this feature engineering often becomes part of the broader model evaluation. LLMs, however, are designed to handle raw text data directly, reducing the need for manual feature engineering.
- Interpretability: With ML models, it is easier to interpret why a model made certain predictions or classifications, and this interpretability can be a core part of their evaluation. This direct interpretation is not possible with LLMs. However, requesting explanations during the generation process can give insights into the model’s decision-making process.
In the following section, we will see a more fine-grained exploration of different types of LLMs. While evaluating general-purpose models is fairly disconnected from ML evaluation, task-specific LLMs are more closely aligned with traditional ML.
General-purpose LLM evaluations
General-purpose evaluations refer to metrics dedicated to base and general-purpose fine-tuned models. They cover a breadth of capabilities that are correlated with knowledge and usefulness without focusing on specific tasks or domains. This allows developers to get an overview of these capabilities, compare themselves with competitors, and identify strengths and weaknesses. Based on these results, it is possible to tweak the dataset and hyperparameters, or even modify the architecture.
We can broadly categorize general-purpose evaluations in three phases: during pre-training, after pre-training, and after fine-tuning.
During pre-training, we closely monitor how the model learns, as shown at the end of Chapter 5. The most straightforward metrics are low-level and correspond to how models are trained:
- Training loss: Based on the cross-entropy loss, measures the difference between the model’s predicted probability distribution and the true distribution of the next token
- Validation loss: Calculates the same loss as training loss, but on a held-out validation set to assess generalization
- Perplexity: Exponential of the cross-entropy loss, representing how “surprised” the model is by the data (lower is better)
- Gradient norm: Monitors the magnitude of gradients during training to detect potential instabilities or vanishing/exploding gradients
It’s also possible to include benchmarks like HellaSwag (common sense reasoning) during this stage but there’s a risk of overfitting these evaluations.
After pre-training, it is common to use a suite of evaluations to evaluate the base model. This suite can include internal and public benchmarks. Here’s a non-exhaustive list of common public pre-training evaluations:
- MMLU (knowledge): Tests models on multiple-choice questions across 57 subjects, from elementary to professional levels
- HellaSwag (reasoning): Challenges models to complete a given situation with the most plausible ending from multiple choices
- ARC-C (reasoning): Evaluates models on grade-school-level multiple-choice science questions requiring causal reasoning
- Winogrande (reasoning): Assesses common sense reasoning through pronoun resolution in carefully crafted sentences
- PIQA (reasoning): Measures physical common sense understanding through questions about everyday physical interactions
Many of these datasets are also used to evaluate general-purpose fine-tuned models. In this case, we focus on the difference in a given score between the base and the fine-tuned model. For example, bad fine-tuning can degrade the knowledge of the model, measured by MMLU. On the contrary, a good one might instill even more knowledge and increase the MMLU score.
This can also help identify any contamination issues, where the model might have been fine-tuned on data that is too close to a test set. For instance, improving the MMLU score of a base model by 10 points during the fine-tuning phase is unlikely. This is a sign that the instruction data might be contaminated.
In addition to these pre-trained evaluations, fine-tuned models also have their own benchmarks. Here, we use the term “fine-tuned model” to designate a model that has been trained with supervised fine-tuning (SFT) and preference alignment. These benchmarks target capabilities connected to the ability of fine-tuned models to understand and answer questions. In particular, they test instruction-following, multi-turn conversation, and agentic skills:
- IFEval (instruction following): Assesses a model’s ability to follow instructions with particular constraints, like not outputting any commas in your answer
- Chatbot Arena (conversation): A framework where humans vote for the best answer to an instruction, comparing two models in head-to-head conversations
- AlpacaEval (instruction following): Automatic evaluation for fine-tuned models that is highly correlated with Chatbot Arena
- MT-Bench (conversation): Evaluates models on multi-turn conversations, testing their ability to maintain context and provide coherent responses
- GAIA (agentic): Tests a wide range of abilities like tool use and web browsing, in a multi-step fashion
Understanding how these evaluations are designed and used is important to choose the best LLM for your application. For example, if you want to fine-tune a model, you want the best base model in terms of knowledge and reasoning for a given size. This allows you to compare the capabilities of different LLMs and pick the one that will offer the strongest foundation for your fine-tuning.
Even if you don’t want to fine-tune a model, benchmarks like Chatbot Arena or IFEval are a good way to compare different instruct models. For instance, you want great conversational abilities if you’re building a chatbot. However, this is not necessary if your end goal is something like information extraction from unstructured documents. In this case, you will benefit more from excellent instruction-following skills to understand and execute tasks.
While these benchmarks are popular and useful, they also suffer from inherent flaws. For example, public benchmarks can be gamed by training models on test data or samples that are very similar to benchmark datasets. Even human evaluation is not perfect and is often biased toward long and confident answers, especially when they’re nicely formatted (e.g., using Markdown). On the other hand, private test sets have not been scrutinized as much as public ones and might have their own issues and biases.
This means that benchmarks are not a single source of truth but should be used as signals. Once multiple evaluations provide a similar answer, you can raise your confidence level about the real capabilities of a model.
Domain-specific LLM evaluations
Domain-specific LLMs don’t have the same scope as general-purpose models. This is helpful to target more fine-grained capabilities with more depth than the previous benchmarks.
Within the category, the choice of benchmarks entirely depends on the domain in question. For common applications like a language-specific model or a code model, it is recommended to search for relevant evaluations and even benchmark suites. These suites encompass different benchmarks and are designed to be reproducible. By targeting different aspects of a domain, they often capture domain performance more accurately.
To illustrate this, here is a list of domain-specific evaluations with leaderboards on the Hugging Face Hub:
- Open Medical-LLM Leaderboard: Evaluates the performance of LLMs in medical question-answering tasks. It regroups 9 metrics, with 1,273 questions from the US medical license exams (MedQA), 500 questions from PubMed articles (PubMedQA), 4,183 questions from Indian medical entrance exams (MedMCQA), and 1,089 questions from 6 sub-categories of MMLU (clinical knowledge, medical genetics, anatomy, professional medicine, college biology, and college medicine).
- BigCodeBench Leaderboard: Evaluates the performance of code LLMs, featuring two main categories: BigCodeBench-Complete for code completion based on structured docstrings, and BigCodeBench-Instruct for code generation from natural language instructions. Models are ranked by their Pass@1 scores using greedy decoding, with an additional Elo rating for the Complete variant. It covers a wide range of programming scenarios that test LLMs’ compositional reasoning and instruction-following capabilities.
- Hallucinations Leaderboard: Evaluates LLMs’ tendency to produce false or unsupported information across 16 diverse tasks spanning 5 categories. These include Question Answering (with datasets like NQ Open, TruthfulQA, and SQuADv2), Reading Comprehension (using TriviaQA and RACE), Summarization (employing HaluEval Summ, XSum, and CNN/DM), Dialogue (featuring HaluEval Dial and FaithDial), and Fact Checking (utilizing MemoTrap, SelfCheckGPT, FEVER, and TrueFalse). The leaderboard also assesses instruction-following ability using IFEval.
- Enterprise Scenarios Leaderboard: Evaluates the performance of LLMs on six real-world enterprise use cases, covering diverse tasks relevant to business applications. The benchmarks include FinanceBench (100 financial questions with retrieved context), Legal Confidentiality (100 prompts from LegalBench for legal reasoning), Writing Prompts (creative writing evaluation), Customer Support Dialogue (relevance in customer service interactions), Toxic Prompts (safety assessment for harmful content generation), and Enterprise PII (business safety for sensitive information protection). Some test sets are closed-source to prevent gaming of the leaderboard. The evaluation focuses on specific capabilities such as answer accuracy, legal reasoning, creative writing, contextual relevance, and safety measures, providing a comprehensive assessment of LLMs’ suitability for enterprise environments.
Leaderboards can have different approaches based on their domain. For example, BigCodeBench is significantly different from others because it relies on only two metrics that sufficiently capture the entire domain. On the other hand, the Hallucinations Leaderboard regroups 16 metrics, including many general-purpose evaluations. It shows that in addition to custom benchmarks, reusing general-purpose ones can complete your own suite.
In particular, language-specific LLMs often reuse translated versions of general-purpose benchmarks. This can be completed with original evaluations in the native language. While some of these benchmarks use machine translation, it is better to rely on human-translated evaluations to improve their quality. We selected the following three task-specific leaderboards and their respective evaluation suites to give you an idea of how to build your own:
- OpenKo-LLM Leaderboard: Evaluates the performance of Korean LLMs using nine metrics. These metrics are a combination of general-purpose benchmarks translated into Korean (GPQA, Winogrande, GSM8K, EQ-Bench, and IFEval) and custom evaluations (Knowledge, Social Value, Harmlessness, and Helpfulness).
- Open Portuguese LLM Leaderboard: Evaluates the performance of Portuguese language LLMs using nine diverse benchmarks. These benchmarks include educational assessments (ENEM with 1,430 questions, and BLUEX with 724 questions from university entrance exams), professional exams (OAB Exams with over 2,000 questions), language understanding tasks (ASSIN2 RTE and STS, FAQUAD NLI), and social media content analysis (HateBR with 7,000 Instagram comments, PT Hate Speech with 5,668 tweets, and tweetSentBR).
- Open Arabic LLM Leaderboard: Evaluates the performance of Arabic language LLMs using a comprehensive set of benchmarks, including both native Arabic tasks and translated datasets. The leaderboard features two native Arabic benchmarks: AlGhafa and Arabic-Culture-Value-Alignment. Additionally, it incorporates 12 translated benchmarks covering various domains, such as MMLU, ARC-Challenge, HellaSwag, and PIQA.
Both general-purpose and domain-specific evaluations are designed with three main principles. First, they should be complex and challenge models to distinguish good and bad outputs. Second, they should be diverse and cover as many topics and scenarios as possible. When one benchmark is not enough, additional ones can create a stronger suite. Finally, they should be practical and easy to run. This is more connected to evaluation libraries, which can be more or less complex to work with. We recommend lm-evaluation-harness (github.com/EleutherAI/lm-evaluation-harness) from Eleuther AI and lighteval (github.com/huggingface/lighteval) from Hugging Face to run your benchmarks.
Task-specific LLM evaluations
While general-purpose and domain-specific evaluations indicate strong base or instruct models, they cannot provide insights into how well these models work for a given task. This requires benchmarks specifically designed for this purpose, measuring downstream performance.
Because of their narrow focus, task-specific LLMs can rarely rely on pre-existing evaluation datasets. This can be advantageous because their outputs also tend to be more structured and easier to evaluate using traditional ML metrics. For example, a summarization task can leverage the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric, which measures the overlap between the generated text and reference text using n-grams.
Likewise, classification tasks also benefit from it and use the following classic metrics, among others:
- Accuracy: Accuracy refers to the proportion of correctly predicted instances compared to the total instances. It’s particularly useful for tasks with categorical outputs or where there is a clear distinction between right and wrong answers, such as named entity recognition (NER).
- Precision: The ratio of true positive predictions to the total positive predictions made by the model.
- Recall: The ratio of true positive predictions to the total actual positive instances.
- F1 Score: The harmonic mean of precision and recall, used to balance both metrics. These are particularly useful in tasks such as classification or entity extraction.
When the task cannot be directly mapped to a traditional ML task, it is possible to create a custom benchmark. This benchmark can be inspired by general-purpose and domain-specific evaluation datasets. A common and successful pattern is the use of multiple-choice question answering. In this framework, the instruction consists of a question with several options. See the following example with a question from the MMLU dataset (abstract algebra):
Instruction Find the degree for the given field extension Q(sqrt(2), sqrt(3)) over Q. A. 0 B. 4 C. 2 D. 6 |
Output B |
Table 7.1: Example from the MMLU dataset
There are two main ways of evaluating models with this scheme—text generation and log-likelihood evaluations:
- The first approach involves having the model generate text responses and comparing those to predefined answer choices. For example, the model generates a letter (A, B, C, or D) as its answer, which is then checked against the correct answer. This method tests the model’s ability to produce coherent and accurate responses in a format similar to how it would be used in real-world applications.
- Evaluation using probabilities, on the other hand, looks at the model’s predicted probabilities for different answer options without requiring text generation. For MMLU, lm-evaluation-harness compares the probabilities for the full text of each answer choice. This approach allows for a more nuanced assessment of the model’s understanding, as it can capture the relative confidence the model has in different options, even if it wouldn’t necessarily generate the exact correct answer text.
For simplicity, we recommend the text-generation version of the evaluation that mimics human test-taking. It is easier to implement, and generally more discriminative, as low-quality models tend to overperform on probability-based evaluations. You can adapt this technique to quiz your models about a particular task, and even expand it to specific domains.
Conversely, if the task is too open-ended, traditional ML metrics and multiple-choice question answering might not be relevant. In this scenario, the LLM-as-a-judge technique introduced in Chapter 5 can be used to evaluate the quality of the answers. If you have ground-truth answers, providing them as additional context improves the accuracy of the evaluation. Otherwise, defining different dimensions (such as relevance or toxicity, depending on your task) can also ground the evaluation in more interpretable categories.
It is recommended to use large models for evaluation and to iteratively refine your prompt. In this process, the explanations outputted by the model are important for understanding errors in its reasoning and fixing them through additional prompt engineering.
In order to easily parse answers, one can specify a structure in the instruction or use some kind of structured generation (like Outlines or OpenAI’s JSON mode). Here is an example of an instruction with a structure:
You are an evaluator who assesses the quality of an answer to an instruction. Your goal is to provide a score that represents how well the answer addresses the instruction. You will use a scale of 1 to 4, where each number represents the following: 1. The answer is not relevant to the instruction. 2. The answer is relevant but not helpful. 3. The answer is relevant and helpful but could be more detailed. 4. The answer is relevant, helpful, and detailed. Please provide your evaluation as follows: ##Evaluation## Explanation: (analyze the relevant, helpfulness, and complexity of the answer) Total rating: (final score as a number between 1 and 4) Instruction: {instruction} Answer: {answer} ##Evaluation## Explanation: |
Table 7.2: Example of general-purpose LLM-as-a-judge prompt for answer evaluation
Naturally, you can tweak the scale, add a ground-truth answer to this prompt, and customize it for your own use cases.
However, judge LLMs can exhibit biases favoring assertive or verbose responses, potentially overrating answers that sound more confident but are less accurate. They may also lack domain expertise for specialized topics, leading to misjudgments. Consistency is also a concern, as LLMs might score similar responses differently. Additionally, they could have implicit preferences for certain writing styles unrelated to actual answer quality. To mitigate these issues, it’s possible to combine LLM evaluations with other metrics, use multiple judges, and carefully design prompts to address biases.
Once a model has been properly evaluated and works as intended, it might be included within a broader system. In the next section, we will see how systems change the evaluation framework.