Splitting the dataset
Through the data preparation process, we have gained a dataset that is ready to be used for model development. To avoid model underfitting and overfitting, it is a best practice to split the dataset randomly yet proportionally, into independent subsets based on the model development process: a training dataset, a validation dataset, and a testing dataset:
- Training dataset: The subset of data used to train the model. The model will learn from the training dataset.
- Validation dataset: The subset of data used to validate the trained model. Model hyperparameters will be tuned for optimization based on validation.
- Testing dataset: The subset of data used to evaluate a final model before its deployment to production.
A common practice is to use 80 percent of the data for the training subset, 10 percent for validation, and 10 percent for testing. When you have a large amount of data, you can split it into 70 percent training, 15 percent validation...