One of the most commonly used methods to evaluate the learning effectiveness of our models is to test the predictions made by the algorithms on data it has never seen before. However, it is not always possible to feed fresh data into our models. One alternative involves subdividing the data at our disposal into training and testing subsets, varying the percentages of data to be assigned to each subset. The percentages usually chosen vary between 70% and 80% for the training subset, with the remaining 20–30% assigned to the testing subset.
The subdivision of the original sample dataset into two subsets for training and testing can be easily performed using the scikit-learn library, as we have done several times in our examples:
from sklearn.model_selection import train_test_split
# Create training and testing subsets
X_train, X_test...