Creating training datasets and avoiding data leakage
One of the biggest threats to the performance of our models is data leakage. Data leakage occurs whenever our models are informed by data that is not in the training dataset. Sometimes, we inadvertently assist our model training with information that cannot be gleaned from the training data alone and end up with an overly rosy assessment of our model's accuracy.
Data scientists do not really intend for this to happen, hence the term leakage. This is not a don't do it kind of discussion. We all know not to do it. This is more of a which steps should I take to avoid the problem? discussion. It is actually quite easy to have some data leakage unless we develop routines to prevent it.
For example, if we have missing values for a feature, we might impute the mean across the whole dataset for those values. However, in order to validate our model, we subsequently split our data into training and testing datasets. We would...