In this recipe, we will be building more complex pipelines using mixed-type columnar data. We'll use a speed dating dataset that was published in 2006 by Fisman et al.: https://doi.org/10.1162/qjec.2006.121.2.673

Perhaps this recipe will be informative in more ways than one, and we'll learn something useful about the mechanics of human mating choices.

The dataset description on the OpenML website reads as follows:

This data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four-minute first date with every other participant of the opposite sex. At the end of their 4 minutes, participants were asked whether they would like to see their date again. They were also asked to rate their date on six attributes: attractiveness, sincerity, intelligence, fun, ambition, and shared interests. The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include demographics, dating habits, self-perception across key attributes, beliefs in terms of what others find valuable in a mate, and lifestyle information.

The problem is to predict mate choices from what we know about participants and their matches. This dataset presents some challenges that can serve an illustrative purpose:

It contains 123 different features, of different types:
- Categorical
- Numerical
- Range features

It also contains the following:

Some missing values
Target imbalance

On the way to solving this problem of predicting mate choices, we will build custom encoders in scikit-learn and a pipeline comprising all features and their preprocessing steps.

The primary focus in this recipe will be on pipelines and transformers. In particular, we will build a custom transformer for working with range features and another one for numerical features.

Getting ready

We'll need the following libraries for this recipe. They are as follows:

OpenML to download the dataset
openml_speed_dating_pipeline_steps to use our custom transformer
imbalanced-learn to work with imbalanced classes
shap to show us the importance of features

In order to install them, we can use pip again:

pip install -q openml openml_speed_dating_pipeline_steps==0.5.5 imbalanced_learn category_encoders shap

OpenML is an organization that intends to make data science and machine learning reproducible and therefore more conducive to research. The OpenML website not only hosts datasets, but also allows the uploading of machine learning results to public leaderboards under the condition that the implementation relies solely on open source. These results and how they were obtained can be inspected in complete detail by anyone who's interested.

In order to retrieve the data, we will use the OpenML Python API. The get_dataset() method will download the dataset; with get_data(), we can get pandas DataFrames for features and target, and we'll conveniently get the information on categorical and numerical feature types:

import openml
dataset = openml.datasets.get_dataset(40536)
X, y, categorical_indicator, _ = dataset.get_data(
  dataset_format='DataFrame',
  target=dataset.default_target_attribute
)
categorical_features = list(X.columns[categorical_indicator]) numeric_features = list(
  X.columns[[not(i) for i in categorical_indicator]]
)

In the original version of the dataset, as presented in the paper, there was a lot more work to do. However, the version of the dataset on OpenML already has missing values represented as numpy.nan, which lets us skip this conversion. You can see this preprocessor on GitHub if you are interested: https://github.com/benman1/OpenML-Speed-Dating

Alternatively, you can use a download link from the OpenML dataset web page at https://www.openml.org/data/get_csv/13153954/speeddating.arff.

With the dataset loaded, and the libraries installed, we are ready to start cracking.

How to do it...

Pipelines are a way of describing how machine learning algorithms, including preprocessing steps, can follow one another in a sequence of transformations on top of the raw dataset before applying a final predictor. We will see examples of these concepts in this recipe and throughout this book.

A few things stand out pretty quickly looking at this dataset. We have a lot of categorical features. So, for modeling, we will need to encode them numerically, as in the Modeling and predicting in Keras recipe in Chapter 1, Getting Started with Artificial Intelligence in Python.

Encoding ranges numerically

Some of these are actually encoded ranges. This means these are ordinal, in other words, they are categories that are ordered; for example, the d_interests_correlate feature contains strings like these:

[[0-0.33], [0.33-1], [-1-0]]

If we were to treat these ranges as categorical variables, we'd lose the information about the order, and we would lose information about how different two values are. However, if we convert them to numbers, we will keep this information and we would be able to apply other numerical transformations on top.

We are going to implement a transformer to plug into an sklearn pipeline in order to convert these range features to numerical features. The basic idea of the conversion is to extract the upper and lower bounds of these ranges as follows:

def encode_ranges(range_str):
  splits = range_str[1:-1].split('-')
  range_max = splits[-1]
  range_min = '-'.join(splits[:-1])
  return range_min, range_max

examples = X['d_interests_correlate'].unique()
[encode_ranges(r) for r in examples]

We'll see this for our example:

[('0', '0.33'), ('0.33', '1'), ('-1', '0')]

In order to get numerical features, we can then take the mean between the two bounds. As we've mentioned before, on OpenML, not only are results shown, but also the models are transparent. Therefore, if we want to submit our model, we can only use published modules. We created a module and published it in the pypi Python package repository, where you can find the package with the complete code: https://pypi.org/project/openml-speed-dating-pipeline-steps/.

Here is the simplified code for RangeTransformer:

from sklearn.base import BaseEstimator, TransformerMixin
import category_encoders.utils as util

class RangeTransformer(BaseEstimator, TransformerMixin):
  def __init__(self, range_features=None, suffix='_range/mean', n_jobs=-1):
    assert isinstance(range_features, list) or range_features is None
    self.range_features = range_features
    self.suffix = suffix
    self.n_jobs = n_jobs

  def fit(self, X, y=None):
    return self

  def transform(self, X, y=None):
    X = util.convert_input(X)
    if self.range_features is None:
      self.range_features = list(X.columns)

    range_data = pd.DataFrame(index=X.index)
    for col in self.range_features:
      range_data[str(col) + self.suffix] = pd.to_numeric(
        self._vectorize(X[col])
      )
    self.feature_names = list(range_data.columns)
    return range_data

    def _vectorize(self, s):
        return Parallel(n_jobs=self.n_jobs)(
            delayed(self._encode_range)(x) for x in s
        )

    @staticmethod
    @lru_cache(maxsize=32)
    def _encode_range(range_str):
        splits = range_str[1:-1].split('-')
        range_max = float(splits[-1])
        range_min = float('-'.join(splits[:-1]))
        return sum([range_min, range_max]) / 2.0

    def get_feature_names(self):
        return self.feature_names

This is a shortened snippet of the custom transformer for ranges. Please see the full implementation on GitHub at https://github.com/benman1/OpenML-Speed-Dating.

Please pay attention to how the fit() and transform() methods are used. We don't need to do anything in the fit() method, because we always apply the same static rule. The transfer() method applies this rule. We've seen the examples previously. What we do in the transfer() method is to iterate over columns. This transformer also shows the use of the parallelization pattern typical to scikit-learn. Additionally, since these ranges repeat a lot, and there aren't so many, we'll use a cache so that, instead of doing costly string transformations, the range value can be retrieved from memory once the range has been processed once.

An important thing about custom transformers in scikit-learn is that they should inherit from BaseEstimator and TransformerMixin, and implement the fit() and transform() methods. Later on, we will require get_feature_names() so we can find out the names of the features generated.

Deriving higher-order features

Let's implement another transformer. As you may have noticed, we have different types of features that seem to refer to the same personal attributes:

Personal preferences
Self-assessment
Assessment of the other person

It seems clear that differences between any of these features could be significant, such as the importance of sincerity versus how sincere someone assesses a potential partner. Therefore, our next transformer is going to calculate the differences between numerical features. This is supposed to help highlight these differences.

These features are derived from other features, and combine information from two (or potentially more features). Let's see what the NumericDifferenceTransformer feature looks like:

import operator

class NumericDifferenceTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, features=None,
                 suffix='_numdist', op=operator.sub, n_jobs=-1
                 ):
        assert isinstance(
            features, list
        ) or features is None
        self.features = features
        self.suffix = suffix
        self.op = op
        self.n_jobs = n_jobs

    def fit(self, X, y=None):
        X = util.convert_input(X)
        if self.features is None:
            self.features = list(
                X.select_dtypes(include='number').columns
            )
        return self

    def _col_name(self, col1, col2):
        return str(col1) + '_' + str(col2) + self.suffix

    def _feature_pairs(self):
        feature_pairs = []
        for i, col1 in enumerate(self.features[:-1]):
            for col2 in self.features[i+1:]:
                feature_pairs.append((col1, col2))
        return feature_pairs

    def transform(self, X, y=None):
        X = util.convert_input(X)

        feature_pairs = self._feature_pairs()
        columns = Parallel(n_jobs=self.n_jobs)(
            delayed(self._col_name)(col1, col2)
            for col1, col2 in feature_pairs
        )
        data_cols = Parallel(n_jobs=self.n_jobs)(
            delayed(self.op)(X[col1], X[col2])
            for col1, col2 in feature_pairs
        )
        data = pd.concat(data_cols, axis=1)
        data.rename(
            columns={i: col for i, col in enumerate(columns)},
            inplace=True, copy=False
        )
        data.index = X.index
        return data

    def get_feature_names(self):
        return self.feature_names

This is a custom transformer that calculates differences between numerical features. Please refer to the full implementation in the repository of the OpenML-Speed-Dating library at https://github.com/benman1/OpenML-Speed-Dating.

This transformer has a very similar structure to RangeTransformer. Please note the parallelization between columns. One of the arguments to the __init__() method is the function that is used to calculate the difference. This is operator.sub() by default. The operator library is part of the Python standard library and implements basic operators as functions. The sub() function does what it sounds like:

import operator
operator.sub(1, 2) == 1 - 2
# True

This gives us a prefix or functional syntax for standard operators. Since we can pass functions as arguments, this gives us the flexibility to specify different operators between columns.

The fit() method this time just collects the names of numerical columns, and we'll use these names in the transform() method.

Combining transformations

We will put these transformers together with ColumnTransformer and the pipeline. However, we'll need to make the association between columns and their transformations. We'll define different groups of columns:

range_cols = [
    col for col in X.select_dtypes(include='category')
    if X[col].apply(lambda x: x.startswith('[')
    if isinstance(x, str) else False).any()
]
cat_columns = list(
  set(X.select_dtypes(include='category').columns) - set(range_cols)
)
num_columns = list(
    X.select_dtypes(include='number').columns
)

Now we have columns that are ranges, columns that are categorical, and columns that are numerical, and we can assign pipeline steps to them.

In our case, we put this together as follows, first in a preprocessor:

from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
import category_encoders as ce
import openml_speed_dating_pipeline_steps as pipeline_steps

preprocessor = ColumnTransformer(
 transformers=[
 ('ranges', Pipeline(steps=[
 ('impute', pipeline_steps.SimpleImputerWithFeatureNames(strategy='constant', fill_value=-1)),
 ('encode', pipeline_steps.RangeTransformer())
 ]), range_cols),
 ('cat', Pipeline(steps=[
 ('impute', pipeline_steps.SimpleImputerWithFeatureNames(strategy='constant', fill_value='-1')),
 ('encode', ce.OneHotEncoder(
 cols=None, # all features that it given by ColumnTransformer
 handle_unknown='ignore',
 use_cat_names=True
 )
 )
 ]), cat_columns),
 ('num', pipeline_steps.SimpleImputerWithFeatureNames(strategy='median'), num_columns),
 ],
 remainder='drop', n_jobs=-1
)

And then we'll put the preprocessing in a pipeline, together with the estimator:

def create_model(n_estimators=100):
    return Pipeline(
        steps=[('preprocessor', preprocessor),
               ('numeric_differences', pipeline_steps.NumericDifferenceTransformer()),
               ('feature_selection', SelectKBest(f_classif, k=20)),
               ('rf', BalancedRandomForestClassifier(
                  n_estimators=n_estimators,
                  )
               )]
       )

Here is the performance in the test set:

from sklearn.metrics import roc_auc_score, confusion_matrix
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
  X, y,
  test_size=0.33,
  random_state=42,
  stratify=y
)
clf = create_model(50)
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
auc = roc_auc_score(y_test, y_predicted)
print('auc: {:.3f}'.format(auc))

We get the following performance as an output:

auc: 0.779

This is a very good performance, as you can see comparing it to the leaderboard on OpenML.

How it works...

It is time to explain basic scikit-learn terminology relevant to this recipe. Neither of these concepts corresponds to existing machine learning algorithms, but to composable modules:

Transformer (in scikit-learn): A class that is derived from sklearn.base.TransformerMixin; it has fit() and transform() methods. These involve preprocessing steps or feature selection.
Predictor: A class that is derived from either sklearn.base.ClassifierMixin or sklearn.base.RegressorMixin; it has fit() and predict() methods. These are machine learning estimators, in other words, classifiers or regressors.
Pipeline: An interface that wraps all steps together and gives you a single interface for all steps of the transformation and the resulting estimator. A pipeline again has fit() and predict() methods.

There are a few things to point out regarding our approach. As we said before, we have missing values, so we have to impute (meaning replace) missing values with other values. In this case, we replace missing values with -1. In the case of categorical variables, this will be a new category, while in the case of numerical variables, it will become a special value that the classifier will have to handle.

ColumnTransformer came with version 0.20 of scikit-learn and was a long-awaited feature. Since then, ColumnTransformer can often be seen like this, for example:

from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

feature_preprocessing = make_column_transformer(
  (StandardScaler(), ['column1', 'column2']),
  (OneHotEncoder(), ['column3', 'column4', 'column5']) 
)

feature_preprocessing can then be used as usual with the fit(), transform(), and fit_transform() methods:

processed_features = feature_preprocessing.fit_transform(X)

Here, X means our features.

Alternatively, we can put ColumnTransformer as a step into a pipeline, for example, like this:

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

make_pipeline(
    feature_preprocessing,
    LogisticRegression()
)

Our classifier is a modified form of the random forest classifier. A random forest is a collection of decision trees, each trained on random subsets of the training data. The balanced random forest classifier (Chen et al.: https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf) makes sure that each random subset is balanced between the two classes.

Since NumericDifferenceTransformer can provide lots of features, we will incorporate an extra step of model-based feature selection.

There's more...

You can see the complete example with the speed dating dataset, a few more custom transformers, and an extended imputation class in the GitHub repository of the openml_speed_dating_pipeline_steps library and notebook, on GitHub at https://github.com/PacktPublishing/Artificial-Intelligence-with-Python-Cookbook/blob/master/chapter02/Transforming%20Data%20in%20Scikit-Learn.ipynb.

Both RangeTransformer and NumericDifferenceTransformer could also have been implemented using FunctionTransformer in scikit-learn.

ColumnTransformer is especially handy for pandas DataFrames or NumPy arrays since it allows the specification of different operations for different subsets of the features. However, another option is FeatureUnion, which allows concatenation of the results from different transformations. For yet another way to chain our operations together, please have a look at PandasPicker in our repository.

In this recipe, we used ANOVA f-values for univariate feature selection, which is relatively simple, yet effective. Univariate feature selection methods are usually simple filters or statistical tests that measure the relevance of a feature with regard to the target. There are, however, many different methods for feature selection, and scikit-learn implements a lot of them: https://scikit-learn.org/stable/modules/feature_selection.html.

Filter reviews by

All

Amazon verified reviews

Isaac Dec 03, 2020

There are a number of survey books on the market for doing AI with python. I own many of them. Auffarth's approach here is unique in that it takes a do then explain the approach. The chapters are well structured with cookbook examples followed by in-depth explanations. What the author calls 'how to do it' then 'how it works.' The breath covered in the book is also impressive. everything from setting up a basic environment to deploying models into production. Applications range from typical deep learning stuff (like predicting housing prices) to image processing, audio applications, and fraud detection. This has become my latest 'go-to' book when I want to explore a new AI topic. I would recommend it to any data scientist looking to take their skills to the next level.

Amazon Verified review

Eric D. Weber Apr 26, 2021

This book has the right mixture of breadth and depth, which is extremely challenging in the (growing) group of AI literature. What I enjoy the most is that you can both learn a lot of Python and/or learn a lot about ML and Experimentation, depending on your focus. Maybe most important, it allows those newer or intermediate in the field to scale up quickly. For beginners and practitioners, there is a lot of value to be found in the 400+ pages.

Lucinda Linde Feb 19, 2021

"Example isn’t another way to teach, it is the only way to teach." ---Albert EinsteinDisclaimer: This review has been requested by the publisher, and I am giving my honest review of this book. This review is based on reading the book. As with any Cookbook, the proof is in the pudding. I intend to use some of these recipes on my own data sets to see if the techniques help me get better results.OverviewArtificial Intelligence with Python Cookbook by Ben Auffarth, is a jam-packed masterclass in applying AI to a selection of important business problems. The book will be helpful to someone who had made many models, and wants to up their game to using more sophisticated AI.What I love about this book:The examples are very relevant. There are over 20 examples which range from detecting anomalies as in Fraud Detection to Time Series to Customer Lifetime Value to modelling spread of disease and many more. These are all applications that have proven business or social ROI.Chapter 1 is filled with helpful tips to set up Google Colab and Jupyter Notebook, as well as ways to write better code. These include speeding up and parallelizing code, putting in progress bars, auto-reloading packages and more. There are also really useful visualization techniques such as correlation heat maps and pair plot. These are fast ways to get a bird's-eye view of exploratory data analysis (EDA).Each chapter is its own guidebook to using a particular technique. For example, I recently did a project on Time Series and got ok results. Chapter 2 gives me additional ways to utilize SARIMAX, which I will try out. There are also scripts using FBProphet, which may be a better way to address Time Series. The code examples are in the book, but also on GitHub for easy copy-and-pasting. The book explains the rationale behind the code.The examples used from chapter to chapter build on each other. The examples start with the basic, such as the iris classification problem, and model using a variety of neural networks. There are chapters on NLP, machine vision, audio files and even deploying AI in production.The visualizations are very helpful to understanding the insights that are being generated from the code. Having the code to make those visualizations is extremely helpful.I will use this book as a reference and starting point for trying out techniques I haven't before. Can't wait to try out the examples in the book.What I don't like about the bookFor topics that I am really unprepared to absorb, the book gives links to material that can cover the basics. It makes me wonder if this book could perhaps be the springboard to multiple books each of which takes a few techniques, and brings the reader up to speed more gently, with examples of increasing complexity so the core concepts can be learned.Overall, "Artificial Intelligence with Python Cookbook" is very well thought out and a fantastic one place to start upping one's game in applying AI to a variety of valuable applications.

Andreas Mueller Dec 09, 2020

I've been impressed by the wide overview of the book, which really spans the gamut of what AI means, from classification to search algorithms and A/B testing. The book focuses on some standard tools but also branches out to surface some lesser known libraries that can come in handy.While 468 pages can only give a taste of each topic, the book is jam-packed with examples and serves as a good starting point with plenty of references.

Murphy Choy Aug 26, 2021

The book is a great recipe book for folks who want quick results without going through reams of content. I love the easy to read structure of this book.

Artificial Intelligence with Python Cookbook: Proven recipes for applying AI algorithms and deep learning techniques using TensorFlow 2.x and PyTorch 1.6

What do you get with eBook?