Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

Create machine learning pipelines using unsupervised AutoML [Tutorial]

Save for later
  • 11 min read
  • 07 Aug 2018

article-image

AutoML uses unsupervised algorithms for performing an automated process of algorithm selection, hyperparameter tuning, iterative modeling, and model assessment.  When your dataset doesn't have a target variable, you can use clustering algorithms to explore it, based on different characteristics. These algorithms group examples together, so that each group will have examples as similar as possible to each other, but dissimilar to examples in other groups.

Since you mostly don't have labels when you are performing such analysis, there is a performance metric that you can use to examine the quality of the resulting separation found by the algorithm.

It is called the Silhouette Coefficient. The Silhouette Coefficient will help you to understand two things:


  • Cohesion: Similarity within clusters
  • Separation: Dissimilarity among clusters

It will give you a value between 1 and -1, with values close to 1 indicating well-formed clusters.


Clustering algorithms are used to tackle many different tasks such as finding similar users, songs, or images, detecting key trends and changes in patterns, understanding community structures in social networks.

This tutorial deals with using unsupervised machine learning algorithms for creating machine learning pipelines.


The code files for this article are available on Github.

This article is an excerpt from a book written by Sibanjan Das, Umit Mert Cakmak titled Hands-On Automated Machine Learning

Commonly used clustering algorithms

There are two types of commonly used clustering algorithms: distance-based and probabilistic models. For example, k-means and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) are distance-based algorithms, whereas the Gaussian mixture model is probabilistic.

Distance-based algorithms may use a variety of distance measures where Euclidean distance metrics are usually used.

Probabilistic algorithms will assume that there is a generative process with a mixture of probability distributions with unknown parameters and the goal is to calculate these parameters from the data.

Since there are many clustering algorithms, picking the right one depends on the characteristics of your data. For example, k-means will work with centroids of clusters and this requires clusters in your data to be evenly sized and convexly shaped. This means that k-means will not work well on elongated clusters or irregularly shaped manifolds. When your clusters in your data are not evenly sized or convexly shaped, you many want to use DBSCAN to cluster areas of any shape.

Knowing a thing or two about your data will bring you closer to finding the right algorithms, but what if you don't know much about your data? Many times when you are performing exploratory analysis, it might be hard to get your head around what's happening. If you find yourself in this kind of situation, an automated unsupervised ML pipeline can help you to understand the characteristics of your data better.

Be careful when you perform this kind of analysis, though; the actions you will take later will be driven by the results you will see and this could quickly send you down the wrong path if you are not cautious.


Creating sample datasets with sklearn

In sklearn, there are some useful ways to create sample datasets for testing algorithms:


# Importing necessary libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set context helps you to adjust things like label size, lines and various elements
# Try "notebook", "talk" or "paper" instead of "poster" to see how it changes
sns.set_context('poster')

# set_color_codes will affect how colors such as 'r', 'b', 'g' will be interpreted
sns.set_color_codes()


# Plot keyword arguments will allow you to set things like size or line width to be used in charts.

plot_kwargs = {'s': 10, 'linewidths': 0.1}

import numpy as np
import pandas as pd

# Pprint will better output your variables in console for readability
from pprint import pprint

# Creating sample dataset using sklearn samples_generator
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

# Make blobs will generate isotropic Gaussian blobs
# You can play with arguments like center of blobs, cluster standard deviation
centers = [[2, 1], [-1.5, -1], [1, -1], [-2, 2]]
cluster_std = [0.1, 0.1, 0.1, 0.1]

# Sample data will help you to see your algorithms behavior
X, y = make_blobs(n_samples=1000,
centers=centers,
cluster_std=cluster_std,
random_state=53)

# Plot generated sample data
plt.scatter(X[:, 0], X[:, 1], **plot_kwargs)
plt.show()


We get the following plot from the preceding code:

create-machine-learning-pipelines-using-unsupervised-automl-img-0


cluster_std will affect the amount of dispersion. Change it to [0.4, 0.5, 0.6, 0.5] and try again:

cluster_std = [0.4, 0.5, 0.6, 0.5]
X, y = make_blobs(n_samples=1000,
centers=centers,
cluster_std=cluster_std,
random_state=53)

plt.scatter(X[:, 0], X[:, 1], **plot_kwargs)
plt.show()


We get the following plot from the preceding code:

create-machine-learning-pipelines-using-unsupervised-automl-img-1


Now it looks more realistic!

Let's write a small class with helpful methods to create unsupervised experiments. First, you will use the fit_predict method to apply one or more clustering algorithms on the sample dataset:


Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
class Unsupervised_AutoML:
def __init__(self, estimators=None, transformers=None):
self.estimators = estimators
self.transformers = transformers
pass


Unsupervised_AutoML class will initialize with a set of estimators and transformers. The second class method will be fit_predict:

def fit_predict(self, X, y=None):
    """
    fit_predict will train given estimator(s) and predict cluster membership for each sample
    """

    # This dictionary will hold predictions for each estimator
    predictions = []
    performance_metrics = {}

for estimator in self.estimators:
labels = estimator['estimator'](*estimator['args'], **estimator['kwargs']).fit_predict(X)
estimator['estimator'].n_clusters_ = len(np.unique(labels))
metrics = self._get_cluster_metrics(estimator['estimator'].__name__, estimator['estimator'].n_clusters_, X, labels, y)
predictions.append({estimator['estimator'].__name__: labels})
performance_metrics[estimator['estimator'].__name__] = metrics

self.predictions = predictions
self.performance_metrics = performance_metrics

return predictions, performance_metrics


The fit_predict method uses the _get_cluster_metrics method to get the performance metrics, which is defined in the following code block:

# Printing cluster metrics for given arguments
def _get_cluster_metrics(self, name, n_clusters_, X, pred_labels, true_labels=None):
    from sklearn.metrics import homogeneity_score, 
        completeness_score, 
        v_measure_score, 
        adjusted_rand_score, 
        adjusted_mutual_info_score, 
        silhouette_score

print("""################## %s metrics #####################""" % name)
if len(np.unique(pred_labels)) >= 2:

silh_co = silhouette_score(X, pred_labels)

if true_labels is not None:

h_score = homogeneity_score(true_labels, pred_labels)
c_score = completeness_score(true_labels, pred_labels)
vm_score = v_measure_score(true_labels, pred_labels)
adj_r_score = adjusted_rand_score(true_labels, pred_labels)
adj_mut_info_score = adjusted_mutual_info_score(true_labels, pred_labels)

metrics = {"Silhouette Coefficient": silh_co,
"Estimated number of clusters": n_clusters_,
"Homogeneity": h_score,
"Completeness": c_score,
"V-measure": vm_score,
"Adjusted Rand Index": adj_r_score,
"Adjusted Mutual Information": adj_mut_info_score}

for k, v in metrics.items():
print("t%s: %0.3f" % (k, v))

return metrics

metrics = {"Silhouette Coefficient": silh_co,
"Estimated number of clusters": n_clusters_}

for k, v in metrics.items():
print("t%s: %0.3f" % (k, v))

return metrics

else:
print("t# of predicted labels is {}, can not produce metrics. n".format(np.unique(pred_labels)))


The _get_cluster_metrics method calculates metrics, such as homogeneity_score, completeness_score, v_measure_score, adjusted_rand_score, adjusted_mutual_info_score, and silhouette_score. These metrics will help you to assess how well the clusters are separated and also measure the similarity within and between clusters.

K-means algorithm in action


You can now apply the KMeans algorithm to see how it works:

from sklearn.cluster import KMeans
estimators = [{'estimator': KMeans, 'args':(), 'kwargs':{'n_clusters': 4}}]

unsupervised_learner = Unsupervised_AutoML(estimators)


You can see the estimators:

unsupervised_learner.estimators


These will output the following:

[{'args': (),
 'estimator': sklearn.cluster.k_means_.KMeans,
 'kwargs': {'n_clusters': 4}}]


You can now invoke fit_predict to obtain predictions and performance_metrics:

predictions, performance_metrics = unsupervised_learner.fit_predict(X, y)


Metrics will be written to the console:

################## KMeans metrics #####################
  Silhouette Coefficient: 0.631
  Estimated number of clusters: 4.000
  Homogeneity: 0.951
  Completeness: 0.951
  V-measure: 0.951
  Adjusted Rand Index: 0.966
  Adjusted Mutual Information: 0.950


You can always print metrics later:

pprint(performance_metrics)


This will output the name of the estimator and its metrics:

{'KMeans': {'Silhouette Coefficient': 0.9280431207593165, 'Estimated number of clusters': 4, 'Homogeneity': 1.0, 'Completeness': 1.0, 'V-measure': 1.0, 'Adjusted Rand Index': 1.0, 'Adjusted Mutual Information': 1.0}}


Let's add another class method to plot the clusters of the given estimator and predicted labels:

# plot_clusters will visualize the clusters given predicted labels
def plot_clusters(self, estimator, X, labels, plot_kwargs):
palette = sns.color_palette('deep', np.unique(labels).max() + 1)
colors = [palette[x] if x >= 0 else (0.0, 0.0, 0.0) for x in labels]

plt.scatter(X[:, 0], X[:, 1], c=colors, **plot_kwargs)
plt.title('{} Clusters'.format(str(estimator.__name__)), fontsize=14)
plt.show()


Let's see the usage:

plot_kwargs = {'s': 12, 'linewidths': 0.1}
unsupervised_learner.plot_clusters(KMeans,
                                   X,
                                   unsupervised_learner.predictions[0]['KMeans'],
                                   plot_kwargs)


You get the following plot from the preceding block:

create-machine-learning-pipelines-using-unsupervised-automl-img-2


In this example, clusters are evenly sized and clearly separate from each other but, when you are doing this kind of exploratory analysis, you should try different hyperparameters and examine the results.

You will write a wrapper function later in this article to apply a list of clustering algorithms and their hyperparameters to examine the results. For now, let's see one more example with k-means where it does not work well.

When clusters in your dataset have different statistical properties, such as differences in variance, k-means will fail to identify clusters correctly:


X, y = make_blobs(n_samples=2000, centers=5, cluster_std=[1.7, 0.6, 0.8, 1.0, 1.2], random_state=220)
# Plot sample data
plt.scatter(X[:, 0], X[:, 1], **plot_kwargs)
plt.show()


We get the following plot from the preceding code:

create-machine-learning-pipelines-using-unsupervised-automl-img-3


Although this sample dataset is generated with five centers, it's not that obvious and there might be four clusters, as well:

from sklearn.cluster import KMeans
estimators = [{'estimator': KMeans, 'args':(), 'kwargs':{'n_clusters': 4}}]

unsupervised_learner = Unsupervised_AutoML(estimators)

predictions, performance_metrics = unsupervised_learner.fit_predict(X, y)


Metrics in the console are as follows:

################## KMeans metrics #####################
  Silhouette Coefficient: 0.549
  Estimated number of clusters: 4.000
  Homogeneity: 0.729
  Completeness: 0.873
  V-measure: 0.795
  Adjusted Rand Index: 0.702
  Adjusted Mutual Information: 0.729


KMeans clusters are plotted as follows:

plot_kwargs = {'s': 12, 'linewidths': 0.1}
unsupervised_learner.plot_clusters(KMeans,
                                   X,
                                   unsupervised_learner.predictions[0]['KMeans'],
                                   plot_kwargs)


We get the following plot from the preceding code:

create-machine-learning-pipelines-using-unsupervised-automl-img-4


In this example, points between red (dark gray) and bottom-green clusters (light gray) seem to form one big cluster. K-means is calculating the centroid based on the mean value of points surrounding that centroid. Here, you need to have a different approach.

The DBSCAN algorithm in action


DBSCAN is one of the clustering algorithms that can deal with non-flat geometry and uneven cluster sizes. Let's see what it can do:

from sklearn.cluster import DBSCAN


estimators = [{'estimator': DBSCAN, 'args':(), 'kwargs':{'eps': 0.5}}]

unsupervised_learner = Unsupervised_AutoML(estimators)

predictions, performance_metrics = unsupervised_learner.fit_predict(X, y)

Metrics in the console are as follows:

################## DBSCAN metrics #####################
  Silhouette Coefficient: 0.231
  Estimated number of clusters: 12.000
  Homogeneity: 0.794
  Completeness: 0.800
  V-measure: 0.797
  Adjusted Rand Index: 0.737
  Adjusted Mutual Information: 0.792


DBSCAN clusters are plotted as follows:

plot_kwargs = {'s': 12, 'linewidths': 0.1}
unsupervised_learner.plot_clusters(DBSCAN,
                                   X,
                                   unsupervised_learner.predictions[0]['DBSCAN'],
                                   plot_kwargs)


We get the following plot from the preceding code:

create-machine-learning-pipelines-using-unsupervised-automl-img-5


Conflict between red (dark gray) and bottom-green (light gray) clusters from the k-means case seems to be gone, but what's interesting here is that some small clusters appeared and some points were not assigned to any cluster based on their distance.

DBSCAN has the eps(epsilon) hyperparameter, which is related to proximity for points to be in same neighborhood; you can play with that parameter to see how the algorithm behaves.

When you are doing this kind of exploratory analysis where you don't know much about the data, visual clues are always important, because metrics can mislead you since not every clustering algorithm can be assessed using similar metrics.


To summarize we learned about many different aspects when it comes to choosing a suitable ML pipeline for a given problem. You gained a better understanding of how unsupervised algorithms may suit your needs for a given problem.

To have a clearer understanding of the different aspects of automated Machine Learning, and how to incorporate automation tasks using practical datasets, check out this book Hands-On Automated Machine Learning.

Read more