In the previous recipes, we showed all the steps required to build a machine learning model —starting with loading data, splitting it into a training and a test set, imputing missing values, encoding categorical features, and—lastly—fitting a decision tree classifier.
The process requires multiple steps to be executed in a certain order, which can sometimes be tricky with a lot of modifications to the pipeline mid-work. That is why scikit-learn introduced Pipelines. By using Pipelines, we can sequentially apply a list of transformations to the data, and then train a given estimator (model).
One important point to be aware of is that the intermediate steps of the Pipeline must have the fit and transform methods (the final estimator only needs the fit method, though). Using Pipelines has several benefits:
- The flow...