When a dataset is large enough, it's a good practice to split it into training and test sets, the former to be used for training the model and the latter to test its performances. In the following diagram, there's a schematic representation of this process:
Training/test set split process schema
There are two main rules in performing such an operation:
- Both datasets must reflect the original distribution
- The original dataset must be randomly shuffled before the split phase in order to avoid a correlation between consequent elements
With scikit-learn, this can be achieved by using the train_test_split() function:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=1000)
The test_size parameter (as well as training_size) allows you to specify the...