Pre-processing data with pipelines: a more complicated example
If you have ever built a data pipeline, you know that it can be a little messy when you are working with several different data types. For example, we might need to impute the median for missing values with continuous features and the most frequent value for categorical features. We might also need to transform our target variable. We explore how to apply different pre-processing to different variables in this recipe.
Getting ready
We will work with a fair number of scikit-learn modules in this recipe. Although this can be confusing at first, you quickly become grateful that scikit-learn has a tool to do pretty much anything you need. Scikit-learn also allows us to add our own transformations to a pipeline if we need to do so. I demonstrate how to construct our own transformer in this recipe.
We will work with wage and employment data from the NLS.
How to do it...
- We start by loading the libraries...