Cross-validation and model selection
In the previous example, we validated our approach by withholding 30% of the data when training, and testing on this subset. This approach is not particularly rigorous: the exact result changes depending on the random train-test split. Furthermore, if we wanted to test several different hyperparameters (or different models) to choose the best one, we would, unwittingly, choose the model that best reflects the specific rows in our test set, rather than the population as a whole.
This can be overcome with cross-validation. We have already encountered cross-validation in Chapter 4, Parallel Collections and Futures. In that chapter, we used random subsample cross-validation, where we created the train-test split randomly.
In this chapter, we will use k-fold cross-validation: we split the training set into k parts (where, typically, k is 10 or 3) and use k-1 parts as the training set and the last as the test set. The train/test cycle is repeated k times, keeping...