Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Science for Marketing Analytics

You're reading from   Data Science for Marketing Analytics Achieve your marketing goals with the data analytics power of Python

Arrow left icon
Product type Paperback
Published in Mar 2019
Publisher
ISBN-13 9781789959413
Length 420 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (3):
Arrow left icon
Tommy Blanchard Tommy Blanchard
Author Profile Icon Tommy Blanchard
Tommy Blanchard
Debasish Behera Debasish Behera
Author Profile Icon Debasish Behera
Debasish Behera
Pranshu Bhatnagar Pranshu Bhatnagar
Author Profile Icon Pranshu Bhatnagar
Pranshu Bhatnagar
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Data Science for Marketing Analytics
Preface
1. Data Preparation and Cleaning FREE CHAPTER 2. Data Exploration and Visualization 3. Unsupervised Learning: Customer Segmentation 4. Choosing the Best Segmentation Approach 5. Predicting Customer Revenue Using Linear Regression 6. Other Regression Techniques and Tools for Evaluation 7. Supervised Learning: Predicting Customer Churn 8. Fine-Tuning Classification Algorithms 9. Modeling Customer Choice Appendix

Chapter 4: Choosing the Best Segmentation Approach


Activity 5: Determining Clusters for High-End Clothing Customer Data Using the Elbow Method with the Sum of Squared Errors

  1. Read in the data from four_cols.csv:

    import pandas as pd
    df = pd.read_csv('four_cols.csv')
  2. Inspect the data using the head function:

    df.head()
  3. Standardize all columns:

    cols = df.columns
    zcols = []
    for col in cols:
      df['z_' + col] = (df[col] - df[col].mean())/df[col].std()
      zcols.append('z_' + col)
  4. Plot the data, using dimensionality reduction (principal component analysis):

    from sklearn import decomposition
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    pca = decomposition.PCA(n_components=2)
    df['pc1'], df['pc2'] = zip(*pca.fit_transform(df[zcols]))
    
    plt.scatter(df['pc1'], df['pc2'])
    plt.show()
  5. Visualize clustering with two and seven clusters:

    from sklearn import cluster
    
    colors = ['r', 'b', 'k', 'g', 'm', 'y', 'c']
    markers = ['^', 'o', 'd', 's', 'P', 'X', 'v']
    
    plt.figure(figsize=(12,16))
    
    for n in range(2,8):
      model = cluster.KMeans(n_clusters=n, random_state=10)
      df['cluster'] = model.fit_predict(df[zcols])
    
      plt.subplot(3, 2, n-1)
      for c in df['cluster'].unique():
        d = df[df['cluster'] == c]
        plt.scatter(d['pc1'], d['pc2'], marker=markers[c], color=colors[c])
    
    plt.show()
  6. Create a plot of the sum of squared errors and look for an elbow:

    import numpy as np
    
    ss = []
    krange = list(range(2,11))
    X = df[zcols].values
    for n in krange:
      model = cluster.KMeans(n_clusters=n, random_state=10)
      model.fit_predict(X)
      cluster_assignments = model.labels_
      centers = model.cluster_centers_
      ss.append(np.sum((X - centers[cluster_assignments]) ** 2))
    
    plt.plot(krange, ss)
    plt.xlabel("$K$")
    plt.ylabel("Sum of Squares")
    plt.show()

Activity 6: Using Different Clustering Techniques on Customer Behavior Data

  1. Read in the data from customer_offers.csv:

    import pandas as pd
    df = pd.read_csv('customer_offers.csv').set_index('customer_name')
  2. Use mean-shift clustering (with quantile = 0.1) to cluster the data:

    from sklearn import cluster
    
    X = df.as_matrix()
    bandwidth = cluster.estimate_bandwidth(X, quantile=0.1, n_samples=500)
    ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
    
    df['ms_cluster'] = ms.fit_predict(X)
  3. Use k-modes clustering (with k=4) to cluster the data:

    from kmodes.kmodes import KModes
    
    km = KModes(n_clusters=4)
    df['kmode_cluster'] = km.fit_predict(X)
  4. Use k-means clustering (with k=4 and random_state=100) to cluster the data:

    model = cluster.KMeans(n_clusters=4, random_state=100)
    df['kmean_cluster'] = model.fit_predict(X)
  5. Using dimensionality reduction (principal component analysis), visualize the resulting clustering of each method:

    from sklearn import decomposition
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    colors = ['r', 'b', 'k', 'g']
    markers = ['^', 'o', 'd', 's']
    
    pca = decomposition.PCA(n_components=2)
    df['pc1'], df['pc2'] = zip(*pca.fit_transform(X))
    
    plt.figure(figsize=(8,12))
    
    ax = plt.subplot(3, 1, 1)
    for c in df['ms_cluster'].unique():
      d = df[df['ms_cluster'] == c]
      plt.scatter(d['pc1'], d['pc2'], marker=markers[c], color=colors[c])    
    ax.set_title('mean-shift')
    ax = plt.subplot(3, 1, 2)
    for c in df['kmode_cluster'].unique():
      d = df[df['kmode_cluster'] == c]
      plt.scatter(d['pc1'], d['pc2'], marker=markers[c], color=colors[c]) 
    ax.set_title('kmode')
    
    ax = plt.subplot(3, 1, 3)
    for c in df['kmean_cluster'].unique():
      d = df[df['kmean_cluster'] == c]
      plt.scatter(d['pc1'], d['pc2'], marker=markers[c], color=colors[c])    
    ax.set_title('kmean')
    
    plt.show()

Activity 7: Evaluating Clustering on Customer Behavior Data

  1. Import the data from customer_offers.csv:

    import pandas as pd
    df = pd.read_csv('customer_offers.csv').set_index('customer_name')
  2. Perform a train-test split using random_state = 100:

    from sklearn import model_selection
    
    X_train, X_test = model_selection.train_test_split(df, random_state = 100)

    Note

    This is a relatively small dataset, with only 100 data points, so it is pretty sensitive to how the data is split up. When datasets are small like this, it might make sense to use other cross-validation methods, which you can read about here: https://scikit-learn.org/stable/modules/cross_validation.html.

  3. Plot the silhouette scores for k-means clustering using k ranging from 2 to 10:

    from sklearn import cluster
    from sklearn import metrics
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    krange = list(range(2,11))
    avg_silhouettes = []
    for n in krange:
      model = cluster.KMeans(n_clusters=n, random_state=100)
      model.fit(X_train)
      cluster_assignments = model.predict(X_test)
      silhouette_avg = metrics.silhouette_score(X_test, cluster_assignments)
      avg_silhouettes.append(silhouette_avg)
    
    plt.plot(krange, avg_silhouettes)
    plt.xlabel("$K$")
    plt.ylabel("Average Silhouette Score")
    plt.show()

    From the plot, you will observe that the maximum silhouette score is obtained at k=3.

  4. Use the k found in the previous step, and print out the silhouette score on the test set:

    model = cluster.KMeans(n_clusters=3, random_state=100)
    model.fit(X_train)
    
    km_labels = model.predict(X_test)
    km_silhouette = metrics.silhouette_score(X_test, km_labels)
    
    print('k-means silhouette score: ' + str(km_silhouette))
  5. Perform mean-shift clustering and print out its silhouette score on the test set:

    bandwidth = cluster.estimate_bandwidth(X_train, quantile=0.1, n_samples=500)
    ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
    
    ms.fit(X_train)
    
    ms_labels = ms.predict(X_test)
    
    ms_silhouette = metrics.silhouette_score(X_test, ms_labels)
    print('mean-shift silhouette score: ' + str(ms_silhouette))
  6. Perform k-modes clustering and print out its silhouette score on the test set:

    from kmodes.kmodes import KModes
    
    km = KModes(n_clusters=4)
    km.fit(X_train)
    
    kmode_labels = km.predict(X_test)
    
    kmode_silhouette = metrics.silhouette_score(X_test, kmode_labels)
    
    print('k-mode silhouette score: ' + str(kmode_silhouette))
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image