Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Applied Unsupervised Learning with R

You're reading from   Applied Unsupervised Learning with R Uncover hidden relationships and patterns with k-means clustering, hierarchical clustering, and PCA

Arrow left icon
Product type Paperback
Published in Mar 2019
Publisher
ISBN-13 9781789956399
Length 320 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Bradford Tuckfield Bradford Tuckfield
Author Profile Icon Bradford Tuckfield
Bradford Tuckfield
Alok Malik Alok Malik
Author Profile Icon Alok Malik
Alok Malik
Arrow right icon
View More author details
Toc

Chapter 1: Introduction to Clustering Methods


Activity 1: k-means Clustering with Three Clusters

Solution:

  1. Load the Iris dataset in the iris_data variable:

    iris_data<-iris
  2. Create a t_color column and make its default value red. Change the value of the two species to green and blue so the third one remains red:

    iris_data$t_color='red'
    iris_data$t_color[which(iris_data$Species=='setosa')]<-'green'
    iris_data$t_color[which(iris_data$Species=='virginica')]<-'blue'

    Note

    Here, we change the color column of only those values whose species is setosa or virginica)

  3. Choose any three random cluster centers:

    k1<-c(7,3)
    k2<-c(5,3)
    k3<-c(6,2.5)
  4. Plot the x, y plot by entering the sepal length and sepal width in the plot() function, along with color:

    plot(iris_data$Sepal.Length,iris_data$Sepal.Width,col=iris_data$t_color)
    points(k1[1],k1[2],pch=4)
    points(k2[1],k2[2],pch=5)
    points(k3[1],k3[2],pch=6)

    Here is the output:

    Figure 1.36: Scatter plot for the given cluster centers

  5. Choose a number of iterations:

    number_of_steps<-10
  6. Choose an the initial value of n:

    n<-1
  7. Start the while loop for finding the cluster centers:

    while(n<number_of_steps){
  8. Calculate the distance of each point from the current cluster centers. We're calculating the Euclidean distance here using the sqrt function:

    iris_data$distance_to_clust1 <- sqrt((iris_data$Sepal.Length-k1[1])^2+(iris_data$Sepal.Width-k1[2])^2)
    iris_data$distance_to_clust2 <- sqrt((iris_data$Sepal.Length-k2[1])^2+(iris_data$Sepal.Width-k2[2])^2)
    iris_data$distance_to_clust3 <- sqrt((iris_data$Sepal.Length-k3[1])^2+(iris_data$Sepal.Width-k3[2])^2)
  9. Assign each point to a cluster to whose center it is closest:

      iris_data$clust_1 <- 1*(iris_data$distance_to_clust1<=iris_data$distance_to_clust2 & iris_data$distance_to_clust1<=iris_data$distance_to_clust3)
      iris_data$clust_2 <- 1*(iris_data$distance_to_clust1>iris_data$distance_to_clust2 & iris_data$distance_to_clust3>iris_data$distance_to_clust2)
      iris_data$clust_3 <- 1*(iris_data$distance_to_clust3<iris_data$distance_to_clust1 & iris_data$distance_to_clust3<iris_data$distance_to_clust2)
  10. Calculate new cluster centers by calculating the mean x and y coordinates of each center with the mean() function in R:

      k1[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_1==1)])
      k1[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_1==1)])
      k2[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_2==1)])
      k2[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_2==1)])
      k3[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_3==1)])
      k3[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_3==1)])
      n=n+1
    }
  11. Choose the color for each center to plot a scatterplot:

    iris_data$color='red'
    iris_data$color[which(iris_data$clust_2==1)]<-'blue'
    iris_data$color[which(iris_data$clust_3==1)]<-'green'
  12. Plot the final plot:

    plot(iris_data$Sepal.Length,iris_data$Sepal.Width,col=iris_data$color)
    points(k1[1],k1[2],pch=4)
    points(k2[1],k2[2],pch=5)
    points(k3[1],k3[2],pch=6)

    The output is as follows:

    Figure 1.37: Scatter plot representing different species in different colors

Activity 2: Customer Segmentation with k-means

Solution:

  1. Download the data from https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Activity02/wholesale_customers_data.csv.

  2. Read the data into the ws variable:

    ws<-read.csv('wholesale_customers_data.csv')
  3. Store only column 5 and 6 in the ws variable by discarding the rest of the columns:

    ws<-ws[5:6]
  4. Import the factoextra library:

    library(factoextra)
  5. Calculate the cluster centers for two centers:

    clus<-kmeans(ws,2)
  6. Plot the chart for two clusters:

    fviz_cluster(clus,data=ws)

    The output is as follows:

    Figure 1.38: Chart for two clusters

    Notice how outliers are also part of the two clusters.

  7. Calculate the cluster centers for three clusters:

    clus<-kmeans(ws,3)
  8. Plot the chart for three clusters:

    fviz_cluster(clus,data=ws)

    The output is as follows:

    Figure 1.39: Chart for three clusters

    Notice some outliers are now a part of a separate cluster.

  9. Calculate the cluster centers for four centers:

    clus<-kmeans(ws,4)
  10. Plot the chart for four clusters:

    fviz_cluster(clus,data=ws)

    The output is as follows:

    Figure 1.40: Chart for four clusters

    Notice how outliers have started separating in two different clusters.

  11. Calculate the cluster centers for five clusters:

    clus<-kmeans(ws,5)
  12. Plot the chart for five clusters:

    fviz_cluster(clus,data=ws)

    The output is as follows:

    Figure 1.41: Chart for five clusters

    Notice how outliers have clearly formed two separate clusters in red and blue, while the rest of the data is classified in three different clusters.

  13. Calculate the cluster centers for six clusters:

    clus<-kmeans(ws,6)
  14. Plot the chart for six clusters:

    fviz_cluster(clus,data=ws)

    The output is as follows:

    Figure 1.42: Chart for six clusters

Activity 3: Performing Customer Segmentation with k-medoids Clustering

Solution:

  1. Read the CSV file into the ws variable:

    ws<-read.csv('wholesale_customers_data.csv')
  2. Store only columns 5 and 6 in the ws variable:

    ws<-ws[5:6]
  3. Import the factoextra library for visualization:

    library(factoextra)
  4. Import the cluster library for clustering by PAM:

    library(cluster)
  5. Calculate clusters by entering data and the number of clusters in the pam function:

    clus<-pam(ws,4)
  6. Plot a visualization of the clusters:

    fviz_cluster(clus,data=ws)

    The output is as follows:

    Figure 1.43: K-medoid plot of the clusters

  7. Again, calculate the clusters with k-means and plot the output to compare with the output of the pam clustering:

    clus<-kmeans(ws,4)
    fviz_cluster(clus,data=ws)

    The output is as follows:

    Figure 1.44: K-means plot of the clusters

Activity 4: Finding the Ideal Number of Market Segments

Solution:

  1. Read the downloaded dataset into the ws variable:

    ws<-read.csv('wholesale_customers_data.csv')
  2. Store only columns 5 and 6 in the variable by discarding other columns:

    ws<-ws[5:6]
  3. Calculate the optimal number of clusters with the silhouette score:

    fviz_nbclust(ws, kmeans, method = "silhouette",k.max=20)

    Here is the output:

    Figure 1.45: Graph representing optimal number of clusters with the silhouette score

    The optimal number of clusters, according to the silhouette score, is two.

  4. Calculate the optimal number of clusters with the WSS score:

    fviz_nbclust(ws, kmeans, method = "wss", k.max=20)

    Here is the output:

    Figure 1.46: Optimal number of clusters with the WSS score

    The optimum number of clusters according to the WSS elbow method is around six.

  5. Calculate the optimal number of clusters with the Gap statistic:

    fviz_nbclust(ws, kmeans, method = "gap_stat",k.max=20)

    Here is the output:

    Figure 1.47: Optimal number of clusters with the Gap statistic

    The optimal number of clusters according to the Gap statistic is one.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image