What is AutoML? When talking about AutoML we mostly refer to automated data preparation (namely feature preprocessing, generation, and selection) and model training (model selection and hyperparameter optimization). The number of possible options for each step of this process can vary vastly depending on the problem type.
AutoML allows researchers and practitioners to automatically build ML pipelines out of the possible options for every step to find high-performing ML models for a given problem.
AutoML libraries carefully set up experiments for various ML pipelines, which covers all the steps from data ingestion, data processing, modeling, and scoring.
In this article we deal with understanding what AutoML is and cover popular AutoML libraries with practical examples.
This article is an excerpt from a book written by Sibanjan Das, Umit Mert Cakmak titled Hands-On Automated Machine Learning.
There are many popular AutoML libraries, and in this section you will get an overview of commonly used ones in the data science community.
Featuretools is a good library for automatically engineering features from relational and transactional data. The library introduces the concept called Deep Feature Synthesis (DFS). If you have multiple datasets with relationships defined among them such as parent-child based on columns that you use as unique identifiers for examples, DFS will create new features based on certain calculations, such as summation, count, mean, mode, standard deviation, and so on. Let's go through a small example where you will have two tables, one showing the database information and the other showing the database transactions for each database:
import pandas as pd
# First dataset contains the basic information for databases.
databases_df = pd.DataFrame({"database_id": [2234, 1765, 8796, 2237, 3398],
"creation_date": ["2018-02-01", "2017-03-02", "2017-05-03", "2013-05-12", "2012-05-09"]})
databases_df.head()
You get the following output:
The following is the code for the database transaction:
# Second dataset contains the information of transaction for each database id db_transactions_df = pd.DataFrame({"transaction_id": [26482746, 19384752, 48571125, 78546789, 19998765, 26482646, 12484752, 42471125, 75346789, 16498765, 65487547, 23453847, 56756771, 45645667, 23423498, 12335268, 76435357, 34534711, 45656746, 12312987], "database_id": [2234, 1765, 2234, 2237, 1765, 8796, 2237, 8796, 3398, 2237, 3398, 2237, 2234, 8796, 1765, 2234, 2237, 1765, 8796, 2237], "transaction_size": [10, 20, 30, 50, 100, 40, 60, 60, 10, 20, 60, 50, 40, 40, 30, 90, 130, 40, 50, 30], "transaction_date": ["2018-02-02", "2018-03-02", "2018-03-02", "2018-04-02", "2018-04-02", "2018-05-02", "2018-06-02", "2018-06-02", "2018-07-02", "2018-07-02", "2018-01-03", "2018-02-03", "2018-03-03", "2018-04-03", "2018-04-03", "2018-07-03", "2018-07-03", "2018-07-03", "2018-08-03", "2018-08-03"]})
db_transactions_df.head()
You get the following output:
The code for the entities is as follows:
# Entities for each of datasets should be defined entities = { "databases" : (databases_df, "database_id"), "transactions" : (db_transactions_df, "transaction_id") }
# Relationships between tables should also be defined as below
relationships = [("databases", "database_id", "transactions", "database_id")]
print(entities)
You get the following output for the preceding code:
The following code snippet will create feature matrix and feature definitions:
# There are 2 entities called ‘databases’ and ‘transactions’
# All the pieces that are necessary to engineer features are in place, you can create your feature matrix as below
import featuretools as ft
feature_matrix_db_transactions, feature_defs = ft.dfs(entities=entities,
relationships=relationships,
target_entity="databases")
The following output shows some of the features that are generated:
You can see all feature definitions by looking at the following features_defs:
feature_defs
The output is as follows:
This is how you can easily generate features based on relational and transactional datasets.
Scikit-learn has a great API for developing ML models and pipelines. Scikit-learn's API is very consistent and mature; if you are used to working with it, auto-sklearn will be just as easy to use since it's really a drop-in replacement for scikit-learn estimators.
Let's see a little example:
# Necessary imports import autosklearn.classification import sklearn.model_selection import sklearn.datasets import sklearn.metrics from sklearn.model_selection import train_test_split
# Digits dataset is one of the most popular datasets in machine learning community.
# Every example in this datasets represents a 8x8 image of a digit.
X, y = sklearn.datasets.load_digits(return_X_y=True)
# Let's see the first image. Image is reshaped to 8x8, otherwise it's a vector of size 64.
X[0].reshape(8,8)
The output is as follows:
You can plot a couple of images to see how they look:
import matplotlib.pyplot as plt
number_of_images = 10
images_and_labels = list(zip(X, y))
for i, (image, label) in enumerate(images_and_labels[:number_of_images]):
plt.subplot(2, number_of_images, i + 1)
plt.axis('off')
plt.imshow(image.reshape(8,8), cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('%i' % label)
plt.show()
Running the preceding snippet will give you the following plot:
Splitting the dataset to train and test data:
# We split our dataset to train and test data X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# Similarly to creating an estimator in Scikit-learn, we create AutoSklearnClassifier
automl = autosklearn.classification.AutoSklearnClassifier()
# All you need to do is to invoke fit method to start experiment with different feature engineering methods and machine learning models
automl.fit(X_train, y_train)
# Generating predictions is same as Scikit-learn, you need to invoke predict method.
y_hat = automl.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_hat))
# Accuracy score 0.98
That was easy, wasn't it?
MLBox is another AutoML library that supports distributed data processing, cleaning, formatting, and state-of-the-art algorithms such as LightGBM and XGBoost. It also supports model stacking, which allows you to combine an information ensemble of models to generate a new model aiming to have better performance than the individual models.
Here's an example of its usage:
# Necessary Imports from mlbox.preprocessing import * from mlbox.optimisation import * from mlbox.prediction import * import wget
file_link = 'https://apsportal.ibm.com/exchange-api/v1/entries/8044492073eb964f46597b4be06ff5ea/data?accessKey=9561295fa407698694b1e254d0099600'
file_name = wget.download(file_link)
print(file_name)
# GoSales_Tx_NaiveBayes.csv
The GoSales dataset contains information for customers and their product preferences:
import pandas as pd df = pd.read_csv('GoSales_Tx_NaiveBayes.csv') df.head()
You get the following output from the preceding code:
Let's create a test set from the same dataset by dropping a target column:
test_df = df.drop(['PRODUCT_LINE'], axis = 1)
# First 300 records saved as test dataset
test_df[:300].to_csv('test_data.csv')
paths = ["GoSales_Tx_NaiveBayes.csv", "test_data.csv"]
target_name = "PRODUCT_LINE"
rd = Reader(sep = ',')
df = rd.train_test_split(paths, target_name)
The output will be similar to the following:
Drift_thresholder will help you to drop IDs and drifting variables between train and test datasets:
dft = Drift_thresholder() df = dft.fit_transform(df)
You get the following output:
Optimiser will optimize the hyperparameters:
opt = Optimiser(scoring = 'accuracy', n_folds = 3) opt.evaluate(None, df)
You get the following output by running the preceding code:
The following code defines the parameters of the ML pipeline:
space = {
'ne__numerical_strategy':{"search":"choice", "space":[0]},
'ce__strategy':{"search":"choice",
"space":["label_encoding","random_projection", "entity_embedding"]},
'fs__threshold':{"search":"uniform", "space":[0.01,0.3]},
'est__max_depth':{"search":"choice", "space":[3,4,5,6,7]}
}
best = opt.optimise(space, df,15)
The following output shows you the selected methods that are being tested by being given the ML algorithms, which is LightGBM in this output:
You can also see various measures such as accuracy, variance, and CPU time:
Using Predictor, you can use the best model to make predictions:
predictor = Predictor() predictor.fit_predict(best, df)
You get the following output:
Tree-Based Pipeline Optimization Tool (TPOT) uses genetic programming to find the best performing ML pipelines, built on top of scikit-learn.
Once your dataset is cleaned and ready to be used, TPOT will help you with the following steps of your ML pipeline:
Once TPOT is done with its experimentation, it will provide you with the best performing pipeline.
TPOT is very user-friendly as it's similar to using scikit-learn's API:
from tpot import TPOTClassifier from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split
# Digits dataset that you have used in Auto-sklearn example
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25)
# You will create your TPOT classifier with commonly used arguments
tpot = TPOTClassifier(generations=10, population_size=30, verbosity=2)
# When you invoke fit method, TPOT will create generations of populations, seeking best set of parameters. Arguments you have used to create TPOTClassifier such as generations and population_size will affect the search space and resulting pipeline.
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
# 0.9834
tpot.export('my_pipeline.py')
Once you have exported your pipeline in the Python my_pipeline.py file, you will see the selected pipeline components:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target =
train_test_split(features, tpot_data['target'].values, random_state=42)
exported_pipeline = KNeighborsClassifier(n_neighbors=6,
weights="distance")
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
To summarize, you learnt about Automated ML and practiced your skills using popular AutoML libraries. This is definitely not the whole list, and AutoML is an active area of research. You should check out other libraries such as Auto-WEKA, which also uses the latest innovations in Bayesian optimization, and Xcessive, which is a user-friendly tool for creating stacked ensembles.
To know how AutoML can be further used to automate parts of Machine Learning, check out the book Hands-On Automated Machine Learning.