Avoiding data leakage
Data leakage occurs when a model is trained with some information that would not be available at the time of prediction. Typically, this leads to high performance in the training set but very poor performance in unseen data. There are two types of data leakage:
- Target leakage is when the information about the target (that we are trying to predict) leaks into some of the features in the model, leading to an overreliance of the model on those features, ultimately leading to poor generalization. This includes features that use the target in any way.
- Train-test contamination is when there is some information leakage between the train and test datasets. This can happen because of the careless handling and splitting of data. But it can also happen in more subtle ways, such as scaling a dataset before splitting the train and test sets.
When we work with time series forecasting problems, the biggest and most common mistake that we can make is...