Pre-processing data with pipelines: a simple example
When doing predictive analysis, we often need to fold all of our pre-processing and feature engineering into a pipeline, including scaling, encoding, and handling outliers and missing values. We discussed the reasons why we might need to incorporate all of our data preparation into a data pipeline in Chapter 8, Encoding, Transforming, and Scaling Features. The main takeaway from that chapter is that pipelines are critical when we are building explanatory models and need to avoid data leakage. This can be trickier still when we are using k-fold cross-validation for model validation, since testing and training DataFrames change during evaluation. Cross-validation has become the norm when constructing predictive models.
Note
k-fold cross-validation trains our model on all but one of the k folds, or parts, leaving one out for testing. This is repeated k times, each time excluding a different fold for testing. Performance...