Preparing High-Quality Training Data for LLM Fine-Tuning

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!

Introduction

Large Language Models (LLM) such as GPT3.5, GPT4, or Claude have shown a very good general capability that can be utilized across different tasks, starting from question and answering, coding assistant, marketing campaign, and many more. However, utilizing those general LLMs in production, especially for enterprises, is not an easy task:

Those models are very large in terms of the number of parameters - resulting in lower latency compared to the smaller model
We need to give a very long prompt to achieve good results - again, resulting in lower latency
Reliability is not ensured - sometimes they can return the response with an additional prefix, which is really annoying when we expect only a JSON format response for example

One of the solutions to solve those problems is fine-tuning a smaller LLM that is specific to the task that we want to handle. For example, we need a QnA model that is able to answer user queries based only on the provided passage. Instead of utilizing those general LLMs, we can just fine-tune a smaller LLM, let’s say a 7 billion parameters LLM to do this specific task. Why to utilize such a giant LLM when our use case is only for QnA?

The quality of training data plays a pivotal role in the success of fine-tuning. Garbage in, garbage out holds true in the world of LLMs. When you fine-tune low-quality data, you risk transferring noise, biases, and inaccuracies to your model. Let’s take the newly released paper, Textbooks Are All You Need II: phi-1.5 Technical Report, as an example. Despite the relatively low number of parameters (1.5B), this model is able to perform as well as models five times its size. Additionally, it excels in complex reasoning tasks, surpassing most non-frontier LLMs. What’s their secret sauce? High-quality of training data!

The next question is how to prepare the training data for LLM fine-tuning. Moreover, how to prepare high-quality training data? Since fine-tuning needs labeled training data, we need to annotate the unlabeled data that we have. Annotating unlabeled data for classification tasks is much easier compared to more complex tasks like summarization. We just need to give labels based on the available classes in the classification task.

If you have deployed an application with those general LLMs before and you have the data coming from real production data, then you can use those data as the training data. Actually, you can also use the response coming from the general LLM as the label directly, no need to do any data annotation anymore. However, what if you don’t have real production data? Then, you can use open-source data or even synthetic data generated by the general LLM as your unlabeled data.

Throughout this article, we’ll discuss ways to give high-quality labels to the unlabeled training data, whether it’s annotated by humans or by general LLM. We’ll discuss what are the pros and cons of each of the annotation options. Furthermore, we’ll discuss in more detail how to utilize general LLM to do the annotation task - along with the step-by-step example.

Without wasting any more time, let’s take a deep breath, make yourselves comfortable, and be ready to learn how to prepare high-quality training data for LLM fine-tuning!

Human Annotated Data

The first option to create high-quality training data is by using the help of human annotators. In the ideal scenario, well-trained human annotators not only are able to produce high-quality training data but also produce labels that are fully steerable according to the criteria (SOP). However, using humans as the annotators will surely be both time and money-consuming. It is also not scalable since we need to wait for a not short time until we can get the labeled data. Finally, the ideal scenario is also hard to achieve since each of the annotators has their own bias towards a specific domain or even the label quality is most often based on their mood.

LLM Annotated Data

Another better option is to utilize general LLM as the annotator. LLMs will always give not only high-quality training data but also full steerability according to the criteria if we do the prompt engineering correctly. It is also cheaper both in terms of time and money. Finally, it’s absolutely scalable and no bias included - except for hallucination.

Let’s see how general LLM is usually utilized as an annotator. We’ll use conversation summarization as the task example. The goal of the task is to summarize the given conversation between two users (User A and User B) and return all important information discussed in the conversation in the form of a summarized paragraph.

1. Write the initial prompt

We need to start from an initial prompt that we will use to generate the summary of the given conversation, or in general, that will be used to generate the label of the given unlabeled sample.

You are an expert in summarizing the given conversation between two users. Return all important information discussed in the conversation in the form of summarized paragraph.

 Conversation:
{}

preparing-high-quality-training-data-for-llm-fine-tuning-img-0

2. Evaluate the generated output with a few samples - qualitatively

Using the initial prompt, we need to evaluate the generated label with few number of samples - let’s say <20 random samples. We need to do this manually by eyeballing through each of the labeled samples and judging qualitatively if they are good enough or not.

If the output quality on these few samples is good enough, then we can move into the next step. If not, then revise the prompt and re-evaluate using another <20 random samples. Repeat this process until you are satisfied with the label quality.

3. Evaluate the generated output with large samples - quantitatively

Once we’re confident enough with the generated labels, we can further assess the quality using a more quantitative approach and with a larger number of samples - let’s say >500 samples. For classification tasks, such as sentiment analysis, evaluating the quality of the labels is easy, we just need to compare the generated label with the ground truth that we have, and then we can calculate the precision, recall, or any other classification metrics that we’re interested in.

However, for more complex tasks, such as the task in this example, we need a more sophisticated metric. There are a couple of widely used metrics for summarization task - BLEU, ROUGE, and many more. However, those metrics are based on a string-matching algorithm only, which means if the generated summary doesn’t contain the exact word used in the conversation, then this score will suggest that the summary quality is not good. To overcome this, many engineers nowadays are utilizing GPT-4 to assess the label quality. For example, we can write a prompt as follows to assess the quality of the generated labels.

Read the given conversation and summary pair. Give the rating quality for the summary with 5 different options: “very bad”, “bad”, “moderate”, “good”, “excellent”. Make sure the summary captures all of the important information in the conversation and does not contain any misinformation.  
Conversation:
{} 
Summary:
{} 
Rating:

preparing-high-quality-training-data-for-llm-fine-tuning-img-1

Once you get the rating, you can map them into integers - for example, “very bad”:0, “bad”: 1, “moderate”: 2, … Please make sure that the LLM that you’re using as the evaluator is not in the same LLMs family with the LLM that you’re using as the annotator. For example, GPT3.5 and GPT4 are both in the same family since they’re both coming from OpenAI.

If the quantitative metric looks decent and meets the criteria, then we can move into the next step. If it’s not, then we can do a subset analysis to see in what kind of cases the label quality is not good. From there, we can revise the prompt and re-evaluate on the same test data. Repeat this step until you’re satisfied enough with the quantitative metric.

4. Apply the final prompt to generate labels in the full data

Finally, we can apply the best prompt that we get from all of those iterations and apply it to generate labels in the full unlabeled data that we have.

Conclusion

Congratulations on keeping up to this point! Throughout this article, you have learned why LLM fine-tuning is important and when to do fine-tuning. You have also learned how to prepare high-quality training data for LLM fine-tuning. Hope the best for your LLM fine-tuning experiments and see you in the next article!

Author Bio

Louis Owen is a data scientist/AI engineer from Indonesia who is always hungry for new knowledge. Throughout his career journey, he has worked in various fields of industry, including NGOs, e-commerce, conversational AI, OTA, Smart City, and FinTech. Outside of work, he loves to spend his time helping data science enthusiasts to become data scientists, either through his articles or through mentoring sessions. He also loves to spend his spare time doing his hobbies: watching movies and conducting side projects.

Currently, Louis is an NLP Research Engineer at Yellow.ai, the world’s leading CX automation platform. Check out Louis’ website to learn more about him! Lastly, if you have any queries or any topics to be discussed, please reach out to Louis via LinkedIn.