Now that we have our response variable, the next step is to split the dataset into train and test sets. In data science, the training set is the data that is used to determine the model coefficients. In the training phase, the model takes into account the predictor variable values together with the response value to "discover" the rules and the weights that will guide the prediction of new data. The testing set is then used to measure our model performance, as we discussed in Chapter 3, Machine Learning Foundations. Typical splits use 70-80% for the training data and 20-30% for the testing data (unless the dataset is very large, in which case a smaller percentage can be allotted toward the testing set).
Some practitioners also have a validation set that is used to train model parameters, such as the tree size in the random...