Having completed the EDA, the next step is to split the dataset into training and test sets. The idea is to have two separate datasets:
- Training set—On this part of the data, we train a machine learning model
- Test set—This part of the data was not seen by the model during training, and is used to evaluate the performance
What we want to achieve by splitting the data is preventing overfitting. Overfitting is a phenomenon whereby a model finds too many patterns in data used for training and performs well only on that particular data. In other words, it fails to generalize to unseen data.
This is a very important step in the analysis, as doing it incorrectly can introduce bias, for example, in the form of data leakage. Data leakage can occur when, during the training phase, a model observes information to which it should...