Handling leakage
A common issue in Kaggle competitions that can affect the outcome of the challenge is data leakage. Data leakage, often mentioned simply as leakage or with other fancy names (such as golden features), involves information in the training phase that won’t be available at prediction time. The presence of such information (leakage) will make your model over-perform in training and testing, allowing you to rank highly in the competition, but will render unusable or at best suboptimal any solution based on it from the sponsor’s point of view.
We can define leakage as “when information concerning the ground truth is artificially and unintentionally introduced within the training feature data, or training metadata” as stated by Michael Kim (https://www.kaggle.com/mikeskim) in his presentation at Kaggle Days San Francisco in 2019.
Leakage is often found in Kaggle competitions, despite careful checking from both the sponsor and...