Preprocessing data with pipelines
We just scratched the surface of what we can do with scikit-learn pipelines in the previous section. We often need to fold all of our Preprocessing and feature engineering into a pipeline, including scaling, encoding, and handling outliers and missing values. This can be complicated as different features may need to be handled differently. We may need to impute the median for missing values with numeric features and the most frequent value for categorical features. We may also need to transform our target variable. We will explore how to do that in this section.
Follow these steps:
- We will start by loading the libraries we have already worked with in this chapter. Then, we will add the
ColumnTransformer
andTransformedTargetRegressor
classes. We will use those classes to transform our features and target, respectively:import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import...