Data segregation
In order to train a model using the processed data, it is recommended to split the data into two subsets:
- Training data
- Testing data
and sometimes into three:
- Training data
- Validation data
- Testing data
You can then train the model on the training data in order to later make predictions on the test data. The training set is visible to the model and it is trained on this data. The training creates an inference engine that can be later applied to new data points that the model has not previously seen. The test dataset (or subset) represents this unseen data and it now can be used to make predictions on this previously unseen data.