Packt+ | Advance your knowledge in tech

You're reading from Data Science for Marketing Analytics Achieve your marketing goals with the data analytics power of Python

Product type Paperback

Published in Mar 2019

Publisher

ISBN-13 9781789959413

Length 420 pages

Edition 1st Edition

Languages

Python

Tools

Pandas

Concepts

Data Science

Authors (3):

Tommy Blanchard

Debasish Behera

Pranshu Bhatnagar

View More author details

Table of Contents (12) Chapters

Data Science for Marketing Analytics

Preface

1. Data Preparation and Cleaning FREE CHAPTER

2. Data Exploration and Visualization

3. Unsupervised Learning: Customer Segmentation

4. Choosing the Best Segmentation Approach

5. Predicting Customer Revenue Using Linear Regression

6. Other Regression Techniques and Tools for Evaluation

7. Supervised Learning: Predicting Customer Churn

8. Fine-Tuning Classification Algorithms

9. Modeling Customer Choice

Appendix

Chapter 4: Choosing the Best Segmentation Approach

Activity 5: Determining Clusters for High-End Clothing Customer Data Using the Elbow Method with the Sum of Squared Errors

Read in the data from four_cols.csv:

import pandas as pd
df = pd.read_csv('four_cols.csv')

Inspect the data using the head function:
```
df.head()
```

Standardize all columns:

cols = df.columns
zcols = []
for col in cols:
  df['z_' + col] = (df[col] - df[col].mean())/df[col].std()
  zcols.append('z_' + col)

Plot the data, using dimensionality reduction (principal component analysis):

from sklearn import decomposition
import matplotlib.pyplot as plt
%matplotlib inline

pca = decomposition.PCA(n_components=2)
df['pc1'], df['pc2'] = zip(*pca.fit_transform(df[zcols]))

plt.scatter(df['pc1'], df['pc2'])
plt.show()

Visualize clustering with two and seven clusters:

from sklearn import cluster

colors = ['r', 'b', 'k', 'g', 'm', 'y', 'c']
markers = ['^', 'o', 'd', 's', 'P', 'X', 'v']

plt.figure(figsize=(12,16))

for n in range(2,8):
  model = cluster.KMeans(n_clusters=n, random_state=10)
  df['cluster'] = model.fit_predict(df[zcols])

  plt.subplot(3, 2, n-1)
  for c in df['cluster'].unique():
    d = df[df['cluster'] == c]
    plt.scatter(d['pc1'], d['pc2'], marker=markers[c], color=colors[c])

plt.show()

Create a plot of the sum of squared errors and look for an elbow:

import numpy as np

ss = []
krange = list(range(2,11))
X = df[zcols].values
for n in krange:
  model = cluster.KMeans(n_clusters=n, random_state=10)
  model.fit_predict(X)
  cluster_assignments = model.labels_
  centers = model.cluster_centers_
  ss.append(np.sum((X - centers[cluster_assignments]) ** 2))

plt.plot(krange, ss)
plt.xlabel("$K$")
plt.ylabel("Sum of Squares")
plt.show()

Activity 6: Using Different Clustering Techniques on Customer Behavior Data

Read in the data from customer_offers.csv:

import pandas as pd
df = pd.read_csv('customer_offers.csv').set_index('customer_name')

Use mean-shift clustering (with quantile = 0.1) to cluster the data:

from sklearn import cluster

X = df.as_matrix()
bandwidth = cluster.estimate_bandwidth(X, quantile=0.1, n_samples=500)
ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)

df['ms_cluster'] = ms.fit_predict(X)

Use k-modes clustering (with k=4) to cluster the data:

from kmodes.kmodes import KModes

km = KModes(n_clusters=4)
df['kmode_cluster'] = km.fit_predict(X)

Use k-means clustering (with k=4 and random_state=100) to cluster the data:

model = cluster.KMeans(n_clusters=4, random_state=100)
df['kmean_cluster'] = model.fit_predict(X)

Using dimensionality reduction (principal component analysis), visualize the resulting clustering of each method:

from sklearn import decomposition
import matplotlib.pyplot as plt
%matplotlib inline

colors = ['r', 'b', 'k', 'g']
markers = ['^', 'o', 'd', 's']

pca = decomposition.PCA(n_components=2)
df['pc1'], df['pc2'] = zip(*pca.fit_transform(X))

plt.figure(figsize=(8,12))

ax = plt.subplot(3, 1, 1)
for c in df['ms_cluster'].unique():
  d = df[df['ms_cluster'] == c]
  plt.scatter(d['pc1'], d['pc2'], marker=markers[c], color=colors[c])    
ax.set_title('mean-shift')
ax = plt.subplot(3, 1, 2)
for c in df['kmode_cluster'].unique():
  d = df[df['kmode_cluster'] == c]
  plt.scatter(d['pc1'], d['pc2'], marker=markers[c], color=colors[c]) 
ax.set_title('kmode')

ax = plt.subplot(3, 1, 3)
for c in df['kmean_cluster'].unique():
  d = df[df['kmean_cluster'] == c]
  plt.scatter(d['pc1'], d['pc2'], marker=markers[c], color=colors[c])    
ax.set_title('kmean')

plt.show()

Activity 7: Evaluating Clustering on Customer Behavior Data

Import the data from customer_offers.csv:

import pandas as pd
df = pd.read_csv('customer_offers.csv').set_index('customer_name')

Perform a train-test split using random_state = 100:
```
from sklearn import model_selection

X_train, X_test = model_selection.train_test_split(df, random_state = 100)
```
Note
This is a relatively small dataset, with only 100 data points, so it is pretty sensitive to how the data is split up. When datasets are small like this, it might make sense to use other cross-validation methods, which you can read about here: https://scikit-learn.org/stable/modules/cross_validation.html.

Plot the silhouette scores for k-means clustering using k ranging from 2 to 10:

from sklearn import cluster
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline

krange = list(range(2,11))
avg_silhouettes = []
for n in krange:
  model = cluster.KMeans(n_clusters=n, random_state=100)
  model.fit(X_train)
  cluster_assignments = model.predict(X_test)
  silhouette_avg = metrics.silhouette_score(X_test, cluster_assignments)
  avg_silhouettes.append(silhouette_avg)

plt.plot(krange, avg_silhouettes)
plt.xlabel("$K$")
plt.ylabel("Average Silhouette Score")
plt.show()

From the plot, you will observe that the maximum silhouette score is obtained at k=3.

Use the k found in the previous step, and print out the silhouette score on the test set:

model = cluster.KMeans(n_clusters=3, random_state=100)
model.fit(X_train)

km_labels = model.predict(X_test)
km_silhouette = metrics.silhouette_score(X_test, km_labels)

print('k-means silhouette score: ' + str(km_silhouette))

Perform mean-shift clustering and print out its silhouette score on the test set:

bandwidth = cluster.estimate_bandwidth(X_train, quantile=0.1, n_samples=500)
ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)

ms.fit(X_train)

ms_labels = ms.predict(X_test)

ms_silhouette = metrics.silhouette_score(X_test, ms_labels)
print('mean-shift silhouette score: ' + str(ms_silhouette))

Perform k-modes clustering and print out its silhouette score on the test set:

from kmodes.kmodes import KModes

km = KModes(n_clusters=4)
km.fit(X_train)

kmode_labels = km.predict(X_test)

kmode_silhouette = metrics.silhouette_score(X_test, kmode_labels)

print('k-mode silhouette score: ' + str(kmode_silhouette))

The rest of the chapter is locked

You're reading from Data Science for Marketing Analytics Achieve your marketing goals with the data analytics power of Python

Table of Contents (12) Chapters

Chapter 4: Choosing the Best Segmentation Approach

Activity 5: Determining Clusters for High-End Clothing Customer Data Using the Elbow Method with the Sum of Squared Errors

Activity 6: Using Different Clustering Techniques on Customer Behavior Data

Activity 7: Evaluating Clustering on Customer Behavior Data

Note

Authors (3)

Other recommended products

Personalised recommendations for you

You're reading from Data Science for Marketing Analytics Achieve your marketing goals with the data analytics power of Python

Table of Contents (12) Chapters

Chapter 4: Choosing the Best Segmentation Approach

Activity 5: Determining Clusters for High-End Clothing Customer Data Using the Elbow Method with the Sum of Squared Errors

Activity 6: Using Different Clustering Techniques on Customer Behavior Data

Activity 7: Evaluating Clustering on Customer Behavior Data

Note

Unlock this book and the full library FREE for 7 days

Authors (3)

Other recommended products

Personalised recommendations for you