Creating training datasets and avoiding data leakage
One of the biggest threats to the performance of our models is data leakage. Data leakage occurs whenever our models are informed by data that is not in the training dataset. We sometimes inadvertently assist our model training with information that cannot be gleaned from the training data alone, and we end up with a too-rosy assessment of our model’s accuracy.
Data scientists do not really intend for this to happen, hence the term “leakage.” This is not a “don’t do it” kind of discussion. We all know not to do it. This is more of a “which steps should I take to avoid the problem?” discussion. It is actually quite easy to have some data leakage unless we develop routines to prevent it.
For example, if we have missing values for a feature, we might impute the mean across the whole dataset for those values. However, in order to validate our model, we subsequently split...