In the previous chapter, we built a model with certain assumptions and settings, measuring its performance with accuracy metrics (the overall ratio of correctly classified labels). To do this, we split our data randomly into training and testing sets. While that approach is fundamental, it has its problems. Most importantly, this way, we may fine-tune our model to gain better performance on the test dataset but at the expense of other data (in other words, we might make the model worse while getting a better metric on the specific dataset). This phenomenon is called overfitting.
To combat this issue, we'll use a slightly more complex approach: cross-validation. In its basic form, cross-validation creates multiple so-called folds or data subsections. Usually, each fold has approximately the same size and can be further balanced by target variable...