Chapter 4: Choosing the Best Segmentation Approach
Activity 5: Determining Clusters for High-End Clothing Customer Data Using the Elbow Method with the Sum of Squared Errors
Read in the data from four_cols.csv:
import pandas as pd df = pd.read_csv('four_cols.csv')
Inspect the data using the head function:
df.head()
Standardize all columns:
cols = df.columns zcols = [] for col in cols: df['z_' + col] = (df[col] - df[col].mean())/df[col].std() zcols.append('z_' + col)
Plot the data, using dimensionality reduction (principal component analysis):
from sklearn import decomposition import matplotlib.pyplot as plt %matplotlib inline pca = decomposition.PCA(n_components=2) df['pc1'], df['pc2'] = zip(*pca.fit_transform(df[zcols])) plt.scatter(df['pc1'], df['pc2']) plt.show()
Visualize clustering with two and seven clusters:
from sklearn import cluster colors = ['r', 'b', 'k', 'g', 'm', 'y', 'c'] markers = ['^', 'o', 'd', 's', 'P', 'X', 'v'] plt.figure(figsize=(12,16)) for n in range(2,8): model = cluster.KMeans(n_clusters=n, random_state=10) df['cluster'] = model.fit_predict(df[zcols]) plt.subplot(3, 2, n-1) for c in df['cluster'].unique(): d = df[df['cluster'] == c] plt.scatter(d['pc1'], d['pc2'], marker=markers[c], color=colors[c]) plt.show()
Create a plot of the sum of squared errors and look for an elbow:
import numpy as np ss = [] krange = list(range(2,11)) X = df[zcols].values for n in krange: model = cluster.KMeans(n_clusters=n, random_state=10) model.fit_predict(X) cluster_assignments = model.labels_ centers = model.cluster_centers_ ss.append(np.sum((X - centers[cluster_assignments]) ** 2)) plt.plot(krange, ss) plt.xlabel("$K$") plt.ylabel("Sum of Squares") plt.show()
Activity 6: Using Different Clustering Techniques on Customer Behavior Data
Read in the data from customer_offers.csv:
import pandas as pd df = pd.read_csv('customer_offers.csv').set_index('customer_name')
Use mean-shift clustering (with quantile = 0.1) to cluster the data:
from sklearn import cluster X = df.as_matrix() bandwidth = cluster.estimate_bandwidth(X, quantile=0.1, n_samples=500) ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True) df['ms_cluster'] = ms.fit_predict(X)
Use k-modes clustering (with k=4) to cluster the data:
from kmodes.kmodes import KModes km = KModes(n_clusters=4) df['kmode_cluster'] = km.fit_predict(X)
Use k-means clustering (with k=4 and random_state=100) to cluster the data:
model = cluster.KMeans(n_clusters=4, random_state=100) df['kmean_cluster'] = model.fit_predict(X)
Using dimensionality reduction (principal component analysis), visualize the resulting clustering of each method:
from sklearn import decomposition import matplotlib.pyplot as plt %matplotlib inline colors = ['r', 'b', 'k', 'g'] markers = ['^', 'o', 'd', 's'] pca = decomposition.PCA(n_components=2) df['pc1'], df['pc2'] = zip(*pca.fit_transform(X)) plt.figure(figsize=(8,12)) ax = plt.subplot(3, 1, 1) for c in df['ms_cluster'].unique(): d = df[df['ms_cluster'] == c] plt.scatter(d['pc1'], d['pc2'], marker=markers[c], color=colors[c]) ax.set_title('mean-shift') ax = plt.subplot(3, 1, 2) for c in df['kmode_cluster'].unique(): d = df[df['kmode_cluster'] == c] plt.scatter(d['pc1'], d['pc2'], marker=markers[c], color=colors[c]) ax.set_title('kmode') ax = plt.subplot(3, 1, 3) for c in df['kmean_cluster'].unique(): d = df[df['kmean_cluster'] == c] plt.scatter(d['pc1'], d['pc2'], marker=markers[c], color=colors[c]) ax.set_title('kmean') plt.show()
Activity 7: Evaluating Clustering on Customer Behavior Data
Import the data from customer_offers.csv:
import pandas as pd df = pd.read_csv('customer_offers.csv').set_index('customer_name')
Perform a train-test split using random_state = 100:
from sklearn import model_selection X_train, X_test = model_selection.train_test_split(df, random_state = 100)
Note
This is a relatively small dataset, with only 100 data points, so it is pretty sensitive to how the data is split up. When datasets are small like this, it might make sense to use other cross-validation methods, which you can read about here: https://scikit-learn.org/stable/modules/cross_validation.html.
Plot the silhouette scores for k-means clustering using k ranging from 2 to 10:
from sklearn import cluster from sklearn import metrics import matplotlib.pyplot as plt %matplotlib inline krange = list(range(2,11)) avg_silhouettes = [] for n in krange: model = cluster.KMeans(n_clusters=n, random_state=100) model.fit(X_train) cluster_assignments = model.predict(X_test) silhouette_avg = metrics.silhouette_score(X_test, cluster_assignments) avg_silhouettes.append(silhouette_avg) plt.plot(krange, avg_silhouettes) plt.xlabel("$K$") plt.ylabel("Average Silhouette Score") plt.show()
From the plot, you will observe that the maximum silhouette score is obtained at k=3.
Use the k found in the previous step, and print out the silhouette score on the test set:
model = cluster.KMeans(n_clusters=3, random_state=100) model.fit(X_train) km_labels = model.predict(X_test) km_silhouette = metrics.silhouette_score(X_test, km_labels) print('k-means silhouette score: ' + str(km_silhouette))
Perform mean-shift clustering and print out its silhouette score on the test set:
bandwidth = cluster.estimate_bandwidth(X_train, quantile=0.1, n_samples=500) ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True) ms.fit(X_train) ms_labels = ms.predict(X_test) ms_silhouette = metrics.silhouette_score(X_test, ms_labels) print('mean-shift silhouette score: ' + str(ms_silhouette))
Perform k-modes clustering and print out its silhouette score on the test set:
from kmodes.kmodes import KModes km = KModes(n_clusters=4) km.fit(X_train) kmode_labels = km.predict(X_test) kmode_silhouette = metrics.silhouette_score(X_test, kmode_labels) print('k-mode silhouette score: ' + str(kmode_silhouette))