Testing Large Language Models (LLMs)

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!

Machine learning has become ubiquitous, with models powering everything from search engines and recommendation systems to chatbots and autonomous vehicles. As these models grow more complex, testing them thoroughly is crucial to ensure they behave as expected. This is especially true for large language models like GPT-4 that generate human-like text and engage in natural conversations.

In this article, we will explore strategies for testing machine learning models, with a focus on evaluating the performance of LLMs.

Introduction

Machine learning models are notoriously challenging to test due to their black-box nature. Unlike traditional code, we cannot simply verify the logic line-by-line. ML models learn from data and make probabilistic predictions, so their decision-making process is opaque.

While testing methods like unit testing and integration testing are common for traditional software, they do not directly apply to ML models. We need more specialized techniques to validate model performance and uncover unexpected or undesirable behavior.

Testing is particularly crucial for large language models. Since LLMs can generate free-form text, it's hard to anticipate their exact responses. Flaws in the training data or model architecture can lead to Hallucinations, biases, and errors that only surface during real-world usage. Rigorous testing provides confidence that the model works as intended.

In this article, we will cover testing strategies to evaluate LLMs. The key techniques we will explore are:

Similarity testing
Column coverage testing
Exact match testing
Visual output testing
LLM-based evaluation

By combining these methods, we can thoroughly test LLMs along multiple dimensions and ensure they provide coherent, accurate, and appropriate responses.

Testing Text Output with Similarity Search

A common output from LLMs is text. This could be anything from chatbot responses to summaries generated from documents. A robust way to test quality of text output is similarity testing.

The idea is simple - we define an expected response and compare the model's actual response to determine how similar they are. The higher the similarity score, the better.

Let's walk through an example using our favorite LLM. Suppose we give it the prompt:

Prompt: What is the capital of Italy?

The expected response would be:

Expected: The capital of Italy is Rome.

Now we can pass this prompt to the LLM and get the actual response:

prompt = "What is the capital of Italy?"
actual = llm.ask(prompt)

Let's say actual contains:

Actual: Rome is the capital of Italy.

While the wording is different, the meaning is the same. To quantify this similarity, we can use semantic search libraries like SentenceTransformers. It represents sentences as numeric vectors and computes similarity using cosine distance.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

expected_embedding = model.encode(expected)
actual_embedding = model.encode(actual)

similarity = cosine_similarity([expected_embedding], [actual_embedding])[0][0]

This yields a similarity score of 0.85, indicating the responses are highly similar in meaning.

We can establish a threshold for the minimum acceptable similarity, like 0.8. Responses below this threshold fail the test. By running similarity testing over many prompt-response pairs, we can holistically assess the textual coherence of an LLM.

Testing Tabular Outputs with Column Coverage

In addition to text, LLMs can output tables or data frames. For testing these, we need different techniques that account for structure.

A good validation is column coverage - checking what percentage of columns in the expected output are present in the actual output.

Consider the LLM answering questions about movies:

Prompt: What are the top 3 highest grossing movies of all time?

Expected:

Movie	Worldwide Gross	Release Year
Avatar	$2,789,679,794	2009
Titanic	$2,187,463,944	1997
Star Wars Ep. VII	$2,068,223,624	2015

Now we can test the LLM’s actual output:

prompt = "What are the top 3 highest grossing movies of all time?"
actual = llm.ask(prompt)

Actual:

Movie	Global Revenue	Year
Avatar	$2.789 billion	2009
Titanic	$2.187 billion	1997
Star Wars: The Force Awakens	$2.068 billion	2015

Here, actual contains the same 3 columns as expected - Movie, Gross, Release Year. So even though the headers and cell values differ slightly, we can pair them with cosine similarity and we will have 100% column coverage.

We can formalize this in code:

expected_cols = set(expected.columns)
actual_cols = set(actual.columns)

column_coverage = len(expected_cols & actual_cols) / len(expected_cols)

# column_coverage = 1.0

For tables with many columns, we may only need say 90% coverage to pass the test. This validation ensures the critical output columns are present while allowing variability in column names or ancillary data.

Exact Match for Numeric Outputs

When LLMs output a single number or statistic, we can use simple exact match testing.

Consider this prompt:

Prompt: What was Apple's total revenue in 2021?

Expected: $365.82 billion

We get the LLM’s response:

prompt = "What was Apple's total revenue in 2021?"
actual = llm.ask(prompt)

Actual: $365.82 billion

In this case, we expect an exact string match:

is_match = (actual == expected)

# is_match = True

For numerical outputs, precision is important. Exact match testing provides a straightforward way to validate this.

Screenshot Testing for Visual Outputs

Building PandasAI, we sometimes need to test generated charts. Testing these outputs requires verifying the visualized data is correct.

One method is screenshot testing - comparing screenshots of the expected and actual visuals. For example:

Prompt: Generate a bar chart comparing the revenue of FAANG companies.

Expected: [Expected_Chart.png]

Actual: [Actual_Chart.png]

We can then test if the images match:

from PIL import Image, ImageChops
expected_img = Image.open("./Expected_Chart.png")
actual_img = Image.open("./Actual_Chart.png")
diff = ImageChops.difference(expected_img, actual_img)
is_match = diff.getbbox() is None

// is_match = True if images match

For more robust validation, we could use computer vision techniques like template matching to identify and compare key elements: axes, bars, labels, etc.

Screenshot testing provides quick validation of visual output without needing to interpret the raw chart data.

LLM-Based Evaluation

An intriguing idea for testing LLMs is to use another LLM!

The concept is to pass the expected and actual outputs to a separate "evaluator" LLM and ask if they match.

For example:

Expected: Rome is the capital of Italy.

Actual: The capital of Italy is Rome.

We can feed this to the evaluator model:

Prompt: Do these two sentences convey the same information? Answer YES or NO

Sentence 1: Rome is the capital of Italy.

Sentence 2: The capital of Italy is Rome.

Evaluator: YES

The evaluator LLM acts like a semantic similarity scorer. This takes advantage of the natural language capabilities of LLMs.

The downside is it evaluates one black box model using another black box model. Errors or biases in the evaluator could lead to incorrect assessments. So LLM-based evaluation should complement other testing approaches, not act as the sole method.

Conclusion

Testing machine learning models thoroughly is critical as they grow more ubiquitous and impactful. Large language models pose unique testing challenges due to their free-form textual outputs.

Using a combination of similarity testing, column coverage validation, exact match, visual output screening, and even LLM-based evaluation, we can rigorously assess LLMs along multiple dimensions.

A comprehensive test suite combining these techniques will catch more flaws and flaws than any single method alone. This builds essential confidence that LLMs behave as expected in the real world.

Testing takes time but prevents much larger problems down the road. The strategies covered in this article will add rigor to the development and deployment of LLMs, helping ensure these powerful models benefit humanity as intended.

Author Bio

Gabriele Venturi is a software engineer and entrepreneur who started coding at the young age of 12. Since then, he has launched several projects across gaming, travel, finance, and other spaces - contributing his technical skills to various startups across Europe over the past decade.

Gabriele's true passion lies in leveraging AI advancements to simplify data analysis. This mission led him to create PandasAI, released open source in April 2023. PandasAI integrates large language models into the popular Python data analysis library Pandas. This enables an intuitive conversational interface for exploring data through natural language queries.

By open-sourcing PandasAI, Gabriele aims to share the power of AI with the community and push boundaries in conversational data analytics. He actively contributes as an open-source developer dedicated to advancing what's possible with generative AI.