Examples involving the scikit-learn preprocessing module
For both imputation and standardization, scikit-learn offers similar APIs:
- First, fit the data to learn the imputer or standardizer.
- Then, use the fitted object to transform new data.
In this section, I will demonstrate two examples, one for imputation and another for standardization.
Note
Scikit-learn uses the same syntax of fit and predict for predictive models. This is a very good practice for keeping the interface consistent. We will cover the machine learning methods in later chapters.
Imputation
First, create an imputer from the SimpleImputer
class. The initialization of the instance allows you to choose missing value forms. It is handy as we can feed our original data into it by treating the question mark as a missing value:
from sklearn.impute import SimpleImputer imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
Note that fit
and transform
can accept the same input:
imputer.fit(df2) df3 = pd.DataFrame(imputer.transform(df2))
Now, check the number of missing values – the result should be 0:
np.sum(np.sum(np.isnan(df3)))
Standardization
Standardization can be implemented in a similar fashion:
from sklearn import preprocessing
The scale function provides the default zero-mean, one-standard deviation transformation:
df4 = pd.DataFrame(preprocessing.scale(df2))
Note
In this example, categorical variables represented by integers are also zero-mean, which should be avoided in production.
Let's check the standard deviation and mean. The following line outputs infinitesimal values:
df4.mean(axis=0)
The following line outputs values close to 1
:
df4.std(axis=0)
Let's look at an example of MinMaxScaler
, which transforms every variable into the range [0, 1]. The following code fits and transforms the heart disease dataset in one step. It is left to you to examine its validity:
minMaxScaler = preprocessing.MinMaxScaler() df5 = pd.DataFrame(minMaxScaler.fit_transform(df2))
Let's now summarize what we have learned in this chapter.