Preparing qualitative data
In this recipe, we will prepare qualitative data, including missing value imputation and encoding.
Getting ready
Qualitative data requires different treatment from quantitative data. Imputing missing values with the mean value of a feature would make no sense (and would not work with non-numeric data): it makes more sense, for example, to use the most frequent value or the mode of a feature. The SimpleImputer
class allows us to do such things.
The same goes for rescaling: it would make no sense to rescale qualitative data. Instead, it is more common to encode it. One of the most typical techniques is called one-hot encoding.
The idea is to transform each of the categories, over a total possible N categories, in a vector holding a 1 and N-1 zeros. In our example, the Embarked
feature’s one-hot encoding would be as follows:
- ‘C’ = [1, 0, 0]
- ‘Q’ = [0, 1, 0]
- ‘S’ = [0, 0, 1]
Note
Having N columns for N categories is not necessarily optimal. What happens if, in the preceding example, we remove the first column? If the value is not ‘Q’ = [1, 0] nor ‘S’ = [0, 1], then it must be ‘C’ = [0, 0]. There is no need to add one more column to have all the necessary information. This can be generalized to N categories only requiring N-1 columns to have all the information, which is why one-hot encoding functions usually allow you to drop a column.
The sklearn
class’ OneHotEncoder
allows us to do this. It also allows us to deal with unknown categories that may appear in the test set (or the production environment) with several strategies, such as an error, ignore, or infrequent class. Finally, it allows us to drop the first column after encoding.
How to do it…
Just like in the preceding recipe, we will handle any missing data and the features will be one-hot encoded:
- Import the necessary classes –
SimpleImputer
for missing data imputation (already imported in the previous recipe) andOneHotEncoder
for encoding. We also need to importnumpy
so that we can concatenate the qualitative and quantitative data that’s been prepared at the end of this recipe:import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
- Select the qualitative features we want to keep:
'Sex'
and'Embarked'
. Then, store these features in new variables for both the train and test sets:quali_columns = ['Sex', 'Embarked']
# Get the quantitative columns
X_train_quali = X_train[quali_columns]
X_test_quali = X_test[quali_columns]
- Instantiate
SimpleImputer
withmost_frequent strategy
. Any missing values will be replaced by the most frequent ones:# Impute missing qualitative values with most frequent feature value
quali_imputer =SimpleImputer(strategy='most_frequent')
- Fit and transform the imputer on the train set, and then transform the test set:
# Fit and impute the training set
X_train_quali = quali_imputer.fit_transform(X_train_quali)
# Just impute the test set
X_test_quali = quali_imputer.transform(X_test_quali)
- Instantiate the encoder. Here, we will specify the following parameters:
drop='first'
: This will drop the first columns of the encodinghandle_unknown='ignore'
: If a new value appears in the test set (or in production), it will be encoded as zeros:# Instantiate the encoder
encoder=OneHotEncoder(drop='first', handle_unknown='ignore')
- Fit and transform the encoder on the training set, and then transform the test set using this encoder:
# Fit and transform the training set
X_train_quali = encoder.fit_transform(X_train_quali).toarray()
# Just encode the test set
X_test_quali = encoder.transform(X_test_quali).toarray()
Note
We need to use .toarray()
out of the encoder because the array is a sparse matrix object by default and cannot be concatenated in that form with the other features.
- With that, all the data has been prepared – both quantitative and qualitative (considering this recipe and the previous one). It is now possible to concatenate this data before training a model:
# Concatenate the data back together
X_train = np.concatenate([X_train_quanti,
X_train_quali], axis=1)
X_test = np.concatenate([X_test_quanti, X_test_quali], axis=1)
There’s more…
It is possible to save the data as a pickle file, either to share it or save it and avoid having to prepare it again. The following code will allow us to do this:
import pickle pickle.dump((X_train, X_test, y_train, y_test), open('prepared_titanic.pkl', 'wb'))
We now have fully prepared data that can be used to train ML models.
Note
Several steps have been omitted or simplified here for more clarity. Data may need more preparation, such as more thorough missing value imputation, outlier and duplicate detection (and perhaps removal), feature engineering, and so on. It is assumed that you already have some sense of those aspects and are encouraged to read other materials about this topic if required.
See also
This more general documentation about missing data imputation is worth looking at: https://scikit-learn.org/stable/modules/impute.html.
Finally, this more general documentation about data preprocessing can be very useful: https://scikit-learn.org/stable/modules/preprocessing.html.