Data leakage
Data leakage, in the context of GenAI, refers to situations where information from outside the desired training dataset is used to create the model, leading to overly optimistic performance metrics and potentially flawed or misleading predictions. This can happen at various stages of model development, from data collection to model evaluation, and can significantly compromise the validity of the AI system. There are multiple types of datasets with different purposes:
- Training datasets, which are used to train the LLM
- Fine-tuning datasets, which can be used to improve LLM responses and reduce hallucinations
- Evaluation datasets, which can be useful in evaluating the accuracy of responses
Causes of data leakage
The causes of data leakage are straightforward and easily avoided, as long as the developers of these applications are aware of these causes. First, let’s understand at a high level what leads to data leakage:
- Inappropriate...