Chapter 1: Introduction to Clustering
Activity 1: Implementing k-means Clustering
Solution:
Load the Iris data file using pandas, a package that makes data wrangling much easier through the use of DataFrames:
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.metrics import silhouette_score from scipy.spatial.distance import cdist iris = pd.read_csv('iris_data.csv', header=None) iris.columns = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'species']
Separate out the X features and the provided y species labels, since we want to treat this as an unsupervised learning problem:
X = iris[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']] y = iris['species']
Get an idea of what our features look like:
X.head()
The output is as follows:
Bring back the k_means function we made earlier for reference:
def k_means(X, K): #Keep track of history so you can see k-means in action centroids_history = [] labels_history = [] rand_index = np.random.choice(X.shape[0], K) centroids = X[rand_index] centroids_history.append(centroids) while True: # Euclidean distances are calculated for each point relative to centroids, #and then np.argmin returns # the index location of the minimal distance - which cluster a point is #assigned to labels = np.argmin(cdist(X, centroids), axis=1) labels_history.append(labels) #Take mean of points within clusters to find new centroids: new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(K)]) centroids_history.append(new_centroids) # If old centroids and new centroids no longer change, k-means is complete and end. Otherwise continue if np.all(centroids == new_centroids): break centroids = new_centroids return centroids, labels, centroids_history, labels_history
Convert our Iris X feature DataFrame to a NumPy matrix:
X_mat = X.values
Run our k_means function on the Iris matrix:
centroids, labels, centroids_history, labels_history = k_means(X_mat, 3)
See what labels we get by looking at just the list of predicted species per sample:
print(labels)
The output is as follows:
Visualize how our k-means implementation performed on the dataset:
plt.scatter(X['SepalLengthCm'], X['SepalWidthCm']) plt.title('Iris - Sepal Length vs Width') plt.show()
The output is as follows:
Visualize the clusters of Iris species as follows:
plt.scatter(X['SepalLengthCm'], X['SepalWidthCm'], c=labels, cmap='tab20b') plt.title('Iris - Sepal Length vs Width - Clustered') plt.show()
The output is as follows:
Calculate the Silhouette Score using scikit-learn implementation:
# Calculate Silhouette Score silhouette_score(X[['SepalLengthCm','SepalWidthCm']], labels)
You will get an SSI roughly equal to 0.369. Since we are only using two features, this is acceptable, combined with the visualization of cluster memberships seen in the final plot.