Packt+ | Advance your knowledge in tech

You're reading from Applied Unsupervised Learning with Python Discover hidden patterns and relationships in unstructured data with Python

Product type Paperback

Published in May 2019

Publisher

ISBN-13 9781789952292

Length 482 pages

Edition 1st Edition

Languages

Python

Tools

Scikit-learn

Concepts

Machine Learning

Authors (3):

Benjamin Johnston

Christopher Kruger

Aaron Jones

View More author details

Table of Contents (12) Chapters

Applied Unsupervised Learning with Python

Preface

1. Introduction to Clustering FREE CHAPTER

2. Hierarchical Clustering

3. Neighborhood Approaches and DBSCAN

4. Dimension Reduction and PCA

5. Autoencoders

6. t-Distributed Stochastic Neighbor Embedding (t-SNE)

7. Topic Modeling

8. Market Basket Analysis

9. Hotspot Analysis

Appendix

Chapter 1: Introduction to Clustering

Activity 1: Implementing k-means Clustering

Solution:

Load the Iris data file using pandas, a package that makes data wrangling much easier through the use of DataFrames:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
from scipy.spatial.distance import cdist

iris = pd.read_csv('iris_data.csv', header=None)
iris.columns = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'species']

Separate out the X features and the provided y species labels, since we want to treat this as an unsupervised learning problem:
```
X = iris[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
y = iris['species']
```
Get an idea of what our features look like:
```
X.head()
```
The output is as follows:
Figure 1.22: First five rows of the data

Bring back the k_means function we made earlier for reference:

def k_means(X, K):
#Keep track of history so you can see k-means in action
    centroids_history = []
    labels_history = []
    rand_index = np.random.choice(X.shape[0], K)  
    centroids = X[rand_index]
    centroids_history.append(centroids)
    while True:
# Euclidean distances are calculated for each point relative to centroids, #and then np.argmin returns
# the index location of the minimal distance - which cluster a point    is #assigned to
        labels = np.argmin(cdist(X, centroids), axis=1)
        labels_history.append(labels)
#Take mean of points within clusters to find new centroids:
        new_centroids = np.array([X[labels == i].mean(axis=0)
                                for i in range(K)])
        centroids_history.append(new_centroids)
        
        # If old centroids and new centroids no longer change, k-means is complete and end. Otherwise continue
        if np.all(centroids == new_centroids):
            break
        centroids = new_centroids
    
    return centroids, labels, centroids_history, labels_history

Convert our Iris X feature DataFrame to a NumPy matrix:
```
X_mat = X.values
```

Run our k_means function on the Iris matrix:

centroids, labels, centroids_history, labels_history = k_means(X_mat, 3)

See what labels we get by looking at just the list of predicted species per sample:
```
print(labels)
```
The output is as follows:
Figure 1.23: List of predicted species
Visualize how our k-means implementation performed on the dataset:
```
plt.scatter(X['SepalLengthCm'], X['SepalWidthCm'])
plt.title('Iris - Sepal Length vs Width')
plt.show()
```
The output is as follows:
Figure 1.24: Plot of performed k-means implementation
Visualize the clusters of Iris species as follows:
```
plt.scatter(X['SepalLengthCm'], X['SepalWidthCm'], c=labels, cmap='tab20b')
plt.title('Iris - Sepal Length vs Width - Clustered')
plt.show()
```
The output is as follows:
Figure 1.25: Clusters of Iris species
Calculate the Silhouette Score using scikit-learn implementation:
```
# Calculate Silhouette Score

silhouette_score(X[['SepalLengthCm','SepalWidthCm']], labels)
```
You will get an SSI roughly equal to 0.369. Since we are only using two features, this is acceptable, combined with the visualization of cluster memberships seen in the final plot.