Splitting data into training and test sets
Having completed the EDA, the next step is to split the dataset into training and test sets. The idea is to have two separate datasets:
- Training set—on this part of the data we train a machine learning model,
- Test set—this part of the data was not seen by the model during training and is used to evaluate its performance.
By splitting the data this way we want to prevent overfitting. Overfitting is a phenomenon that occurs when a model finds too many patterns in data used for training and performs well only on that particular data. In other words, it fails to generalize to unseen data.
This is a very important step in the analysis, as doing it incorrectly can introduce bias, for example, in the form of data leakage. Data leakage can occur when, during the training phase, a model observes information to which it should not have access. We follow up with an example. A common scenario is that of imputing missing values with the feature...