You're reading from Hands-On Automated Machine Learning

Product type Book

Published in Apr 2018

Publisher Packt

ISBN-13 9781788629898

Pages 282 pages

Edition 1st Edition

Languages

Python

Concepts

Machine Learning

Authors (2):

Sibanjan Das

Umit Mert Cakmak

View More author details

Overview of AutoML libraries

There are many popular AutoML libraries, and in this section you will have an overview of commonly used ones in the data science community.

Featuretools

Featuretools (https://www.featuretools.com/) is a good library for automatically engineering features from relational and transactional data. The library introduces the concept called Deep Feature Synthesis (DFS). If you have multiple datasets with relationships defined among them such as parent-child based on columns that you use as unique identifiers for examples, DFS will create new features based on certain calculations, such as summation, count, mean, mode, standard deviation, and so on. Let's go through a small example where you will have two tables, one showing the database information and the other showing the database transactions for each database:

import pandas as pd

# First dataset contains the basic information for databases.
databases_df = pd.DataFrame({"database_id": [2234, 1765, 8796, 2237, 3398], 
"creation_date": ["2018-02-01", "2017-03-02", "2017-05-03", "2013-05-12", "2012-05-09"]})

databases_df.head()

You get the following output:

The following is the code for the database transaction:

# Second dataset contains the information of transaction for each database id
db_transactions_df = pd.DataFrame({"transaction_id": [26482746, 19384752, 48571125, 78546789, 19998765, 26482646, 12484752, 42471125, 75346789, 16498765, 65487547, 23453847, 56756771, 45645667, 23423498, 12335268, 76435357, 34534711, 45656746, 12312987], 
                "database_id": [2234, 1765, 2234, 2237, 1765, 8796, 2237, 8796, 3398, 2237, 3398, 2237, 2234, 8796, 1765, 2234, 2237, 1765, 8796, 2237], 
                "transaction_size": [10, 20, 30, 50, 100, 40, 60, 60, 10, 20, 60, 50, 40, 40, 30, 90, 130, 40, 50, 30],
                "transaction_date": ["2018-02-02", "2018-03-02", "2018-03-02", "2018-04-02", "2018-04-02", "2018-05-02", "2018-06-02", "2018-06-02", "2018-07-02", "2018-07-02", "2018-01-03", "2018-02-03", "2018-03-03", "2018-04-03", "2018-04-03", "2018-07-03", "2018-07-03", "2018-07-03", "2018-08-03", "2018-08-03"]})

db_transactions_df.head()

You get the following output:

The code for the entities is as follows:

# Entities for each of datasets should be defined
entities = {
"databases" : (databases_df, "database_id"),
"transactions" : (db_transactions_df, "transaction_id")
}

# Relationships between tables should also be defined as below
relationships = [("databases", "database_id", "transactions", "database_id")]

print(entities)

You get the following output for the preceding code:

The following code snippet will create feature matrix and feature definitions:

# There are 2 entities called ‘databases’ and ‘transactions’

# All the pieces that are necessary to engineer features are in place, you can create your feature matrix as below

import featuretools as ft

feature_matrix_db_transactions, feature_defs = ft.dfs(entities=entities,
relationships=relationships,
target_entity="databases")

The following output shows some of the features that are generated:

Generated features using databases and transaction entities

You can see all feature definitions by looking at the following features_defs:

feature_defs

The output is as follows:

This is how you can easily generate features based on relational and transactional datasets.

Auto-sklearn

Scikit-learn has a great API for developing ML models and pipelines. Scikit-learn's API is very consistent and mature; if you are used to working with it, auto-sklearn (http://automl.github.io/auto-sklearn/stable/) will be just as easy to use since it's really a drop-in replacement for scikit-learn estimators.

Let's see a little example:

# Necessary imports
import autosklearn.classification
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
from sklearn.model_selection import train_test_split

# Digits dataset is one of the most popular datasets in machine learning community.
# Every example in this datasets represents a 8x8 image of a digit.
X, y = sklearn.datasets.load_digits(return_X_y=True)

# Let's see the first image. Image is reshaped to 8x8, otherwise it's a vector of size 64.
X[0].reshape(8,8)

The output is as follows:

You can plot a couple of images to see how they look:

import matplotlib.pyplot as plt

number_of_images = 10
images_and_labels = list(zip(X, y))

for i, (image, label) in enumerate(images_and_labels[:number_of_images]):
    plt.subplot(2, number_of_images, i + 1)
    plt.axis('off')
    plt.imshow(image.reshape(8,8), cmap=plt.cm.gray_r, interpolation='nearest')
    plt.title('%i' % label)

plt.show()

Running the preceding snippet will give you the following plot:

Splitting the dataset to train and test data:

# We split our dataset to train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Similarly to creating an estimator in Scikit-learn, we create AutoSklearnClassifier
automl = autosklearn.classification.AutoSklearnClassifier()

# All you need to do is to invoke fit method to start experiment with different feature engineering methods and machine learning models
automl.fit(X_train, y_train)

# Generating predictions is same as Scikit-learn, you need to invoke predict method.
y_hat = automl.predict(X_test)

print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_hat))
# Accuracy score 0.98

That was easy, wasn't it?

MLBox

MLBox (http://mlbox.readthedocs.io/en/latest/) is another AutoML library and it supports distributed data processing, cleaning, formatting, and state-of-the-art algorithms such as LightGBM and XGBoost. It also supports model stacking, which allows you to combine an information ensemble of models to generate a new model aiming to have better performance than the individual models.

Here's an example of its usage:

# Necessary Imports
from mlbox.preprocessing import *
from mlbox.optimisation import *
from mlbox.prediction import *
import wget

file_link = 'https://apsportal.ibm.com/exchange-api/v1/entries/8044492073eb964f46597b4be06ff5ea/data?accessKey=9561295fa407698694b1e254d0099600'
file_name = wget.download(file_link)

print(file_name)
# GoSales_Tx_NaiveBayes.csv

The GoSales dataset contains information for customers and their product preferences:

import pandas as pd
df = pd.read_csv('GoSales_Tx_NaiveBayes.csv')
df.head()

You get the following output from the preceding code:

Let's create a test set from the same dataset by dropping a target column:

test_df = df.drop(['PRODUCT_LINE'], axis = 1)

# First 300 records saved as test dataset
test_df[:300].to_csv('test_data.csv')

paths = ["GoSales_Tx_NaiveBayes.csv", "test_data.csv"]
target_name = "PRODUCT_LINE"

rd = Reader(sep = ',')
df = rd.train_test_split(paths, target_name)

The output will be similar to the following:

Drift_thresholder will help you to drop IDs and drifting variables between train and test datasets:

dft = Drift_thresholder()
df = dft.fit_transform(df)

You get the following output:

Optimiser will optimize the hyperparameters:

opt = Optimiser(scoring = 'accuracy', n_folds = 3)
opt.evaluate(None, df)

You get the following output by running the preceding code:

The following code defines the parameters of the ML pipeline:

space = {  
        'ne__numerical_strategy':{"search":"choice", "space":[0]},
        'ce__strategy':{"search":"choice",
               "space":["label_encoding","random_projection", "entity_embedding"]}, 
        'fs__threshold':{"search":"uniform", "space":[0.01,0.3]}, 
        'est__max_depth':{"search":"choice", "space":[3,4,5,6,7]}
        }

best = opt.optimise(space, df,15)

The following output shows you the selected methods that are being tested by being given the ML algorithms, which is LightGBM in this output:

You can also see various measures such as accuracy, variance, and CPU time:

Using Predictor, you can use the best model to make predictions:

predictor = Predictor()
predictor.fit_predict(best, df)

You get the following output:

TPOT

Tree-Based Pipeline Optimization Tool (TPOT) is using genetic programming to find the best performing ML pipelines, and it is built on top of scikit-learn.

Once your dataset is cleaned and ready to be used, TPOT will help you with the following steps of your ML pipeline:

Feature preprocessing
Feature construction and selection
Model selection
Hyperparameter optimization

Once TPOT is done with its experimentation, it will provide you with the best performing pipeline.

TPOT is very user-friendly as it's similar to using scikit-learn's API:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# Digits dataset that you have used in Auto-sklearn example
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

# You will create your TPOT classifier with commonly used arguments
tpot = TPOTClassifier(generations=10, population_size=30, verbosity=2)

# When you invoke fit method, TPOT will create generations of populations, seeking best set of parameters. Arguments you have used to create TPOTClassifier such as generations and population_size will affect the search space and resulting pipeline.
tpot.fit(X_train, y_train)

print(tpot.score(X_test, y_test))
# 0.9834
tpot.export('my_pipeline.py')

Once you have exported your pipeline in the Python my_pipeline.py file, you will see the selected pipeline components:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target =\
            train_test_split(features, tpot_data['target'].values, random_state=42)


exported_pipeline = KNeighborsClassifier(n_neighbors=6, 
   weights="distance")

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

This is it!