In this recipe, we will be building more complex pipelines using mixed-type columnar data. We'll use a speed dating dataset that was published in 2006 by Fisman et al.: https://doi.org/10.1162/qjec.2006.121.2.673
Perhaps this recipe will be informative in more ways than one, and we'll learn something useful about the mechanics of human mating choices.
The dataset description on the OpenML website reads as follows:
This data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four-minute first date with every other participant of the opposite sex. At the end of their 4 minutes, participants were asked whether they would like to see their date again. They were also asked to rate their date on six attributes: attractiveness, sincerity, intelligence, fun, ambition, and shared interests. The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include demographics, dating habits, self-perception across key attributes, beliefs in terms of what others find valuable in a mate, and lifestyle information.
The problem is to predict mate choices from what we know about participants and their matches. This dataset presents some challenges that can serve an illustrative purpose:
- It contains 123 different features, of different types:
- Categorical
- Numerical
- Range features
It also contains the following:
- Some missing values
- Target imbalance
On the way to solving this problem of predicting mate choices, we will build custom encoders in scikit-learn and a pipeline comprising all features and their preprocessing steps.
The primary focus in this recipe will be on pipelines and transformers. In particular, we will build a custom transformer for working with range features and another one for numerical features.
Getting ready
We'll need the following libraries for this recipe. They are as follows:
- OpenML to download the dataset
- openml_speed_dating_pipeline_steps to use our custom transformer
- imbalanced-learn to work with imbalanced classes
- shap to show us the importance of features
In order to install them, we can use pip again:
pip install -q openml openml_speed_dating_pipeline_steps==0.5.5 imbalanced_learn category_encoders shap
OpenML is an organization that intends to make data science and machine learning reproducible and therefore more conducive to research. The OpenML website not only hosts datasets, but also allows the uploading of machine learning results to public leaderboards under the condition that the implementation relies solely on open source. These results and how they were obtained can be inspected in complete detail by anyone who's interested.
In order to retrieve the data, we will use the OpenML Python API. The get_dataset() method will download the dataset; with get_data(), we can get pandas DataFrames for features and target, and we'll conveniently get the information on categorical and numerical feature types:
import openml
dataset = openml.datasets.get_dataset(40536)
X, y, categorical_indicator, _ = dataset.get_data(
dataset_format='DataFrame',
target=dataset.default_target_attribute
)
categorical_features = list(X.columns[categorical_indicator]) numeric_features = list(
X.columns[[not(i) for i in categorical_indicator]]
)
In the original version of the dataset, as presented in the paper, there was a lot more work to do. However, the version of the dataset on OpenML already has missing values represented as
numpy.nan, which lets us skip this conversion. You can see this preprocessor on GitHub if you are interested:
https://github.com/benman1/OpenML-Speed-Dating
Alternatively, you can use a download link from the OpenML dataset web page at https://www.openml.org/data/get_csv/13153954/speeddating.arff.
With the dataset loaded, and the libraries installed, we are ready to start cracking.
How to do it...
Pipelines are a way of describing how machine learning algorithms, including preprocessing steps, can follow one another in a sequence of transformations on top of the raw dataset before applying a final predictor. We will see examples of these concepts in this recipe and throughout this book.
A few things stand out pretty quickly looking at this dataset. We have a lot of categorical features. So, for modeling, we will need to encode them numerically, as in the Modeling and predicting in Keras recipe in Chapter 1, Getting Started with Artificial Intelligence in Python.
Encoding ranges numerically
Some of these are actually encoded ranges. This means these are ordinal, in other words, they are categories that are ordered; for example, the d_interests_correlate feature contains strings like these:
[[0-0.33], [0.33-1], [-1-0]]
If we were to treat these ranges as categorical variables, we'd lose the information about the order, and we would lose information about how different two values are. However, if we convert them to numbers, we will keep this information and we would be able to apply other numerical transformations on top.
We are going to implement a transformer to plug into an sklearn pipeline in order to convert these range features to numerical features. The basic idea of the conversion is to extract the upper and lower bounds of these ranges as follows:
def encode_ranges(range_str):
splits = range_str[1:-1].split('-')
range_max = splits[-1]
range_min = '-'.join(splits[:-1])
return range_min, range_max
examples = X['d_interests_correlate'].unique()
[encode_ranges(r) for r in examples]
We'll see this for our example:
[('0', '0.33'), ('0.33', '1'), ('-1', '0')]
In order to get numerical features, we can then take the mean between the two bounds. As we've mentioned before, on OpenML, not only are results shown, but also the models are transparent. Therefore, if we want to submit our model, we can only use published modules. We created a module and published it in the pypi Python package repository, where you can find the package with the complete code: https://pypi.org/project/openml-speed-dating-pipeline-steps/.
Here is the simplified code for RangeTransformer:
from sklearn.base import BaseEstimator, TransformerMixin
import category_encoders.utils as util
class RangeTransformer(BaseEstimator, TransformerMixin):
def __init__(self, range_features=None, suffix='_range/mean', n_jobs=-1):
assert isinstance(range_features, list) or range_features is None
self.range_features = range_features
self.suffix = suffix
self.n_jobs = n_jobs
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
X = util.convert_input(X)
if self.range_features is None:
self.range_features = list(X.columns)
range_data = pd.DataFrame(index=X.index)
for col in self.range_features:
range_data[str(col) + self.suffix] = pd.to_numeric(
self._vectorize(X[col])
)
self.feature_names = list(range_data.columns)
return range_data
def _vectorize(self, s):
return Parallel(n_jobs=self.n_jobs)(
delayed(self._encode_range)(x) for x in s
)
@staticmethod
@lru_cache(maxsize=32)
def _encode_range(range_str):
splits = range_str[1:-1].split('-')
range_max = float(splits[-1])
range_min = float('-'.join(splits[:-1]))
return sum([range_min, range_max]) / 2.0
def get_feature_names(self):
return self.feature_names
This is a shortened snippet of the custom transformer for ranges. Please see the full implementation on GitHub at https://github.com/benman1/OpenML-Speed-Dating.
Please pay attention to how the fit() and transform() methods are used. We don't need to do anything in the fit() method, because we always apply the same static rule. The transfer() method applies this rule. We've seen the examples previously. What we do in the transfer() method is to iterate over columns. This transformer also shows the use of the parallelization pattern typical to scikit-learn. Additionally, since these ranges repeat a lot, and there aren't so many, we'll use a cache so that, instead of doing costly string transformations, the range value can be retrieved from memory once the range has been processed once.
An important thing about custom transformers in scikit-learn is that they should inherit from BaseEstimator and TransformerMixin, and implement the fit() and transform() methods. Later on, we will require get_feature_names() so we can find out the names of the features generated.
Deriving higher-order features
Let's implement another transformer. As you may have noticed, we have different types of features that seem to refer to the same personal attributes:
- Personal preferences
- Self-assessment
- Assessment of the other person
It seems clear that differences between any of these features could be significant, such as the importance of sincerity versus how sincere someone assesses a potential partner. Therefore, our next transformer is going to calculate the differences between numerical features. This is supposed to help highlight these differences.
These features are derived from other features, and combine information from two (or potentially more features). Let's see what the NumericDifferenceTransformer feature looks like:
import operator
class NumericDifferenceTransformer(BaseEstimator, TransformerMixin):
def __init__(self, features=None,
suffix='_numdist', op=operator.sub, n_jobs=-1
):
assert isinstance(
features, list
) or features is None
self.features = features
self.suffix = suffix
self.op = op
self.n_jobs = n_jobs
def fit(self, X, y=None):
X = util.convert_input(X)
if self.features is None:
self.features = list(
X.select_dtypes(include='number').columns
)
return self
def _col_name(self, col1, col2):
return str(col1) + '_' + str(col2) + self.suffix
def _feature_pairs(self):
feature_pairs = []
for i, col1 in enumerate(self.features[:-1]):
for col2 in self.features[i+1:]:
feature_pairs.append((col1, col2))
return feature_pairs
def transform(self, X, y=None):
X = util.convert_input(X)
feature_pairs = self._feature_pairs()
columns = Parallel(n_jobs=self.n_jobs)(
delayed(self._col_name)(col1, col2)
for col1, col2 in feature_pairs
)
data_cols = Parallel(n_jobs=self.n_jobs)(
delayed(self.op)(X[col1], X[col2])
for col1, col2 in feature_pairs
)
data = pd.concat(data_cols, axis=1)
data.rename(
columns={i: col for i, col in enumerate(columns)},
inplace=True, copy=False
)
data.index = X.index
return data
def get_feature_names(self):
return self.feature_names
This is a custom transformer that calculates differences between numerical features. Please refer to the full implementation in the repository of the OpenML-Speed-Dating library at https://github.com/benman1/OpenML-Speed-Dating.
This transformer has a very similar structure to RangeTransformer. Please note the parallelization between columns. One of the arguments to the __init__() method is the function that is used to calculate the difference. This is operator.sub() by default. The operator library is part of the Python standard library and implements basic operators as functions. The sub() function does what it sounds like:
import operator
operator.sub(1, 2) == 1 - 2
# True
This gives us a prefix or functional syntax for standard operators. Since we can pass functions as arguments, this gives us the flexibility to specify different operators between columns.
The fit() method this time just collects the names of numerical columns, and we'll use these names in the transform() method.
Combining transformations
We will put these transformers together with ColumnTransformer and the pipeline. However, we'll need to make the association between columns and their transformations. We'll define different groups of columns:
range_cols = [
col for col in X.select_dtypes(include='category')
if X[col].apply(lambda x: x.startswith('[')
if isinstance(x, str) else False).any()
]
cat_columns = list(
set(X.select_dtypes(include='category').columns) - set(range_cols)
)
num_columns = list(
X.select_dtypes(include='number').columns
)
Now we have columns that are ranges, columns that are categorical, and columns that are numerical, and we can assign pipeline steps to them.
In our case, we put this together as follows, first in a preprocessor:
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
import category_encoders as ce
import openml_speed_dating_pipeline_steps as pipeline_steps
preprocessor = ColumnTransformer(
transformers=[
('ranges', Pipeline(steps=[
('impute', pipeline_steps.SimpleImputerWithFeatureNames(strategy='constant', fill_value=-1)),
('encode', pipeline_steps.RangeTransformer())
]), range_cols),
('cat', Pipeline(steps=[
('impute', pipeline_steps.SimpleImputerWithFeatureNames(strategy='constant', fill_value='-1')),
('encode', ce.OneHotEncoder(
cols=None, # all features that it given by ColumnTransformer
handle_unknown='ignore',
use_cat_names=True
)
)
]), cat_columns),
('num', pipeline_steps.SimpleImputerWithFeatureNames(strategy='median'), num_columns),
],
remainder='drop', n_jobs=-1
)
And then we'll put the preprocessing in a pipeline, together with the estimator:
def create_model(n_estimators=100):
return Pipeline(
steps=[('preprocessor', preprocessor),
('numeric_differences', pipeline_steps.NumericDifferenceTransformer()),
('feature_selection', SelectKBest(f_classif, k=20)),
('rf', BalancedRandomForestClassifier(
n_estimators=n_estimators,
)
)]
)
Here is the performance in the test set:
from sklearn.metrics import roc_auc_score, confusion_matrix
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.33,
random_state=42,
stratify=y
)
clf = create_model(50)
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
auc = roc_auc_score(y_test, y_predicted)
print('auc: {:.3f}'.format(auc))
We get the following performance as an output:
auc: 0.779
This is a very good performance, as you can see comparing it to the leaderboard on OpenML.
How it works...
It is time to explain basic scikit-learn terminology relevant to this recipe. Neither of these concepts corresponds to existing machine learning algorithms, but to composable modules:
- Transformer (in scikit-learn): A class that is derived from sklearn.base.TransformerMixin; it has fit() and transform() methods. These involve preprocessing steps or feature selection.
- Predictor: A class that is derived from either sklearn.base.ClassifierMixin or sklearn.base.RegressorMixin; it has fit() and predict() methods. These are machine learning estimators, in other words, classifiers or regressors.
- Pipeline: An interface that wraps all steps together and gives you a single interface for all steps of the transformation and the resulting estimator. A pipeline again has fit() and predict() methods.
There are a few things to point out regarding our approach. As we said before, we have missing values, so we have to impute (meaning replace) missing values with other values. In this case, we replace missing values with -1. In the case of categorical variables, this will be a new category, while in the case of numerical variables, it will become a special value that the classifier will have to handle.
ColumnTransformer came with version 0.20 of scikit-learn and was a long-awaited feature. Since then, ColumnTransformer can often be seen like this, for example:
Here, X means our features.
Alternatively, we can put ColumnTransformer as a step into a pipeline, for example, like this:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
make_pipeline(
feature_preprocessing,
LogisticRegression()
)
Our classifier is a modified form of the random forest classifier. A random forest is a collection of decision trees, each trained on random subsets of the training data. The balanced random forest classifier (Chen et al.: https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf) makes sure that each random subset is balanced between the two classes.
Since NumericDifferenceTransformer can provide lots of features, we will incorporate an extra step of model-based feature selection.
There's more...
You can see the complete example with the speed dating dataset, a few more custom transformers, and an extended imputation class in the GitHub repository of the openml_speed_dating_pipeline_steps library and notebook, on GitHub at https://github.com/PacktPublishing/Artificial-Intelligence-with-Python-Cookbook/blob/master/chapter02/Transforming%20Data%20in%20Scikit-Learn.ipynb.
Both RangeTransformer and NumericDifferenceTransformer could also have been implemented using FunctionTransformer in scikit-learn.
ColumnTransformer is especially handy for pandas DataFrames or NumPy arrays since it allows the specification of different operations for different subsets of the features. However, another option is FeatureUnion, which allows concatenation of the results from different transformations. For yet another way to chain our operations together, please have a look at PandasPicker in our repository.
See also
In this recipe, we used ANOVA f-values for univariate feature selection, which is relatively simple, yet effective. Univariate feature selection methods are usually simple filters or statistical tests that measure the relevance of a feature with regard to the target. There are, however, many different methods for feature selection, and scikit-learn implements a lot of them: https://scikit-learn.org/stable/modules/feature_selection.html.