Data scientists need to assess the performance of a model, overcome overfitting, and tune the hyperparameters. All these tasks require some hidden data records that were not used in the model development phase. Before model development, the data needs to be divided into some parts, such as train, test, and validation sets. The training dataset is used to build the model. The test dataset is used to assess the performance of a model that was trained on the train set. The validation set is used to find the hyperparameters. Let's look at the following strategies for the train-test split in the upcoming subsections:
- Holdout method
- K-fold cross-validation
- Bootstrap method
Holdout
In this method, the dataset is divided randomly into two parts: a training and testing set. Generally, this ratio is 2:1, which means 2/3 for training and 1/3 for testing. We can also split it into different ratios, such as 6:4, 7:3, and 8:2:
# partition data into training...