Detecting and Mitigating Hallucinations in LLMs

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!

Introduction

In large language models, the term "hallucination" describes a behavior where an AI model produces results that are not entirely accurate or might sound nonsensical. It's important to understand that large language models are not search engines or databases. They do not search for information from external sources or perform complex computations. Instead, large language models (LLM) belong to a category of generative artificial intelligence.

Recap How Generative AI Works

Generative AI is a technology trained on large volumes of data and, as a result, can " generate" text, images, and even audio. This makes it fundamentally different from search engines and other software tools you might be familiar with. This foundational difference presents challenges, most notably that generative AI can’t cite sources for its responses. Large language models are also not designed to solve computational problems like math. However, generative AI can quickly generate code that might solve complex mathematical challenges. A large language model responds to inputs, most notably the text instruction called a "prompt." As the large language model generates text, it uses its training data as a foundation to extrapolate information.

Understanding Hallucinations

The simplest way to understand a hallucination is the old game of telephone. In the same way, a message gets distorted in the game of telephone; information can get "distorted" or "hallucinated" as a language model tries to generate outputs based on patterns it observed in its training data. The model might "misremember" or "misinterpret" certain information, leading to inaccuracies.

Let's use another model example to understand the concept of generating unique combinations of words in the context of food recipes. Imagine you want to create new recipes by observing existing ones. If you were to build a Markov model for food ingredients, you would:

1. Compile a comprehensive dataset of recipes and extract individual ingredients.

2. Create pairs of neighboring ingredients, like "tomato-basil" and "chicken-rice," and record how often each pair occurs.

For example, if you start with the ingredient "chicken," you might notice it's frequently paired with "broccoli" and "garlic" but less so with "pineapple." If you then choose "broccoli" as the next ingredient, it might be equally likely to be paired with "cheese" or "lemon." By following these ingredient pairings, at some point, the model might suggest creative combinations like "chicken-pineapple-lemon," offering new culinary ideas based on observed patterns.

This approach allows the Markov model to generate novel recipe ideas based on the statistical likelihood of ingredient pairings.

Hallucinations as a Feature

When researching or computing factual information, a hallucination is a bad thing. However, the same concept that gets a bad rap for accurate information or research is what makes large language models demonstrate another human condition of creativity. As a developer, if you want to make your language model creative, OpenAI, for example, has a "temperature" input, a hyperparameter that makes the model's outputs more random. A high temperature of 1 or above will result in hallucinations and randomness. For example, a lower temperature of .2 will make the modern outputs more deterministic to match patterns it was trained on.

detecting-and-mitigating-hallucinations-in-llms-img-0

As an experiment, try inputting a prompt to any large language model chatbot, including ChatGPT, to provide a plot of any romantic story without copying existing accounts on the internet, a new storyline, and new characters. The LLM will offer a fictitious story with characters, a plot, multiple acts, an arc, and an ending.

In specific scenarios, end users or developers might intentionally coax their large language models into a state of "hallucination. When seeking out-of-the-box ideas to think beyond its training, you can get abstract ideas. In this scenario, the model's ability to "hallucinate" isn't a bug but rather a feature. To continue the experiment, you can return to ChatGPT and ask it to pretend you have changed the temperature hyperparameter to 1.1 and re-write the story. Your results will be very “creative.”

In creative pursuits, like crafting tales or penning poems, these so-called "hallucinations" aren't just tolerated; they're celebrated. They can add depth, surprise, and innovation layers to the generated content.

Types of hallucination

One can categorize hallucinations into different forms.

Intrinsic hallucination happens to contradict the source material directly. It also offers logical inconsistency and factual inaccuracies.
Extrinsic hallucination does not contradict. However, at the same time, it cannot be verified against any source. Hence, it adds elements that are considered to be unconfirmable and speculative.

Detecting hallucinations

Detecting hallucinations in the large language models is a tricky task. LLMs will deliver information with the same tone and certainty even if the answer is unknown. It puts the responsibility of users and developers to be careful about how information from LLMs is used.

The following techniques can be utilized to uncover or measure hallucinations in large language models.

Identify the grounding data

Grounding data is the standard against which the Large Language Model (LLM) output is measured. The selection of grounding data depends on the specific application. For example, real job resumes could be grounding data for generating resume-related content. Conversely, search engine outcomes could be employed for web-based inquiries. Especially in language translation, the choice of grounding data is pivotal for accurate translation.

For example, official legal documents could serve as grounding data for legal translations, ensuring precision in the translated content.

Create a measurement test set

A measurement test data set comprises input/output pairs incorporating human interactions and the Large Language Model (LLM). These datasets often include various input conditions and their corresponding program outputs. These sets may involve simulated interactions between users and software systems, depending on the scenario.

Ideally, there should be a minimum of two kinds of test sets:

1. A standard or randomly generated test set that would be conventional but cater to diverse scenarios.

2. An adversarial test set is created through performance in edge cases, high-risk situations, or when presented with deliberately misleading or tricky inputs, even security threats.

Extract any claims

Following the preparation of test data sets, the subsequent stage involves extracting assertions from the Large Language Model (LLM). This extraction can occur manually, through rule-based methodologies, or even by employing machine learning models.

In data analysis, the next step is to extract specific patterns from the data after gathering the datasets. This extraction can be done manually or through predefined rules, basic descriptive analytics, or, for large-scale projects, machine learning algorithms. Each method has its merits and drawbacks, which we will thoroughly investigate.

Use validations against any grounding data

Validation guarantees that the content generated by the Large Language Model (LLM) corresponds to the grounding data. Frequently, this stage replicates the techniques employed for data extraction.

To support the above, here is the code snippet of the same.

# Define grounding data (acceptable sentences)
grounding_data = [
    "The sky is blue.",
    "Python is a popular programming language.",
    "ChatGPT provides intelligent responses."
]
 
# List of generated sentences to be validated
generated_sentences = [
    "The sky is blue.",
    "ChatGPT is a popular programming language.",
    "Python provides intelligent responses."
]
 
# Validate generated sentences against grounding data
valid_sentences = [sentence for sentence in generated_sentences if sentence in grounding_data]
 
# Output valid sentences
print("Valid Sentences:")
for sentence in valid_sentences:
    print("- " + sentence)
 
# Output invalid sentences
invalid_sentences = list(set(generated_sentences) - set(valid_sentences))
print("\nInvalid Sentences:")
for sentence in invalid_sentences:
    print("- " + sentence)

Output:

 Valid Sentences:
- The sky is blue.
 
Invalid Sentences:
- ChatGPT is a popular programming language.
- Python provides intelligent responses.

Furthermore, in verifying research findings, validation ensures that the conclusions drawn from the research align with the collected data. This process often mirrors the research methods employed earlier.

Metrics reporting

The "Grounding Defect Rate" is a crucial metric that measures the proportion of responses lacking context to the total generated outputs. Further metrics will be explored later for a more detailed assessment.

For instance, the "Error Rate" is a vital metric indicating the percentage of mistranslated phrases from the translated text. Additional metrics will be introduced later for a comprehensive evaluation.

A Multifaceted Approach to Mitigate Hallucination in the Large Language Model

detecting-and-mitigating-hallucinations-in-llms-img-2

Leveraging product design

The developer needs to employ large language models so that it does not create material issues, even when it hallucinates. For example, you would not design an application that writes your annual report or news articles. Instead, opinion pieces or content summarization within a prompt can immediately lower the risk of problematic hallucination.

If an app allows AI-generated outputs to be distributed, end users should be able to review and revise the content. It adds a protective layer of scrutiny and puts the responsibility into the hands of the user.

Continuous improvement and logging

Persisting prompts and LLM output are essential for auditing purposes. As models evolve, you cannot count on prompting an LLM and getting the same result. However, regression testing and reviewing user input are critical as long as it adheres to data, security, and privacy practice.

Prompt engineering

It is essential to get the best possible output to use the concept of meta prompts effectively. A meta prompt is a high-level instruction given to a language model to guide its output in a specific direction. Rather than asking a direct question, provide context, structure, and guidance to refine the output.

For example, instead of asking, "What is photosynthesis?", you can ask, "Explain photosynthesis in simple terms suitable for a 5th-grade student." This will adjust the complexity and style of the answer you get.

Multi-Shot Prompts

Multi-shot prompts refer to a series of prompts given to a language model, often in succession. The goal is to guide the model step-by-step toward a desired output instead of asking for a large chunk of information in a single prompt. This approach is extremely useful when the required information is complex or extensive. Typically, these prompts are best delivered as a chat user experience, allowing the user and model to break down the requests into multiple, manageable parts.

Conclusion

The issue of hallucination in Large Language Models (LLMs) presents a significant hurdle for consumers, users, and developers. While overhauling the foundational architecture of these models isn't a feasible solution for most, the good news is that there are strategies to navigate these challenges. But beyond these technical solutions, there's an ethical dimension to consider. As developers and innovators harness the power of LLMs, it's imperative to prioritize disclosure and transparency. Only through openness can we ensure that LLMs integrate seamlessly into our daily lives and gain the trust and acceptance they require to revolutionize our digital interactions truly.

Author Bio

Ryan Goodman has dedicated 20 years to the business of data and analytics, working as a practitioner, executive, and entrepreneur. He recently founded DataTools Pro after 4 years at Reliant Funding, where he served as the VP of Analytics and BI. There, he implemented a modern data stack, utilized data sciences, integrated cloud analytics, and established a governance structure. Drawing from his experiences as a customer, Ryan is now collaborating with his team to develop rapid deployment industry solutions. These solutions utilize machine learning, LLMs, and modern data platforms to significantly reduce the time to value for data and analytics teams.