Chapter 1: Introduction to Clustering Methods
Activity 1: k-means Clustering with Three Clusters
Solution:
Load the Iris dataset in the iris_data variable:
iris_data<-iris
Create a t_color column and make its default value red. Change the value of the two species to green and blue so the third one remains red:
iris_data$t_color='red' iris_data$t_color[which(iris_data$Species=='setosa')]<-'green' iris_data$t_color[which(iris_data$Species=='virginica')]<-'blue'
Note
Here, we change the color column of only those values whose species is setosa or virginica)
Choose any three random cluster centers:
k1<-c(7,3) k2<-c(5,3) k3<-c(6,2.5)
Plot the x, y plot by entering the sepal length and sepal width in the plot() function, along with color:
plot(iris_data$Sepal.Length,iris_data$Sepal.Width,col=iris_data$t_color) points(k1[1],k1[2],pch=4) points(k2[1],k2[2],pch=5) points(k3[1],k3[2],pch=6)
Here is the output:
Figure 1.36: Scatter plot for the given cluster centers
Choose a number of iterations:
number_of_steps<-10
Choose an the initial value of n:
n<-1
Start the while loop for finding the cluster centers:
while(n<number_of_steps){
Calculate the distance of each point from the current cluster centers. We're calculating the Euclidean distance here using the sqrt function:
iris_data$distance_to_clust1 <- sqrt((iris_data$Sepal.Length-k1[1])^2+(iris_data$Sepal.Width-k1[2])^2) iris_data$distance_to_clust2 <- sqrt((iris_data$Sepal.Length-k2[1])^2+(iris_data$Sepal.Width-k2[2])^2) iris_data$distance_to_clust3 <- sqrt((iris_data$Sepal.Length-k3[1])^2+(iris_data$Sepal.Width-k3[2])^2)
Assign each point to a cluster to whose center it is closest:
iris_data$clust_1 <- 1*(iris_data$distance_to_clust1<=iris_data$distance_to_clust2 & iris_data$distance_to_clust1<=iris_data$distance_to_clust3) iris_data$clust_2 <- 1*(iris_data$distance_to_clust1>iris_data$distance_to_clust2 & iris_data$distance_to_clust3>iris_data$distance_to_clust2) iris_data$clust_3 <- 1*(iris_data$distance_to_clust3<iris_data$distance_to_clust1 & iris_data$distance_to_clust3<iris_data$distance_to_clust2)
Calculate new cluster centers by calculating the mean x and y coordinates of each center with the mean() function in R:
k1[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_1==1)]) k1[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_1==1)]) k2[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_2==1)]) k2[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_2==1)]) k3[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_3==1)]) k3[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_3==1)]) n=n+1 }
Choose the color for each center to plot a scatterplot:
iris_data$color='red' iris_data$color[which(iris_data$clust_2==1)]<-'blue' iris_data$color[which(iris_data$clust_3==1)]<-'green'
Plot the final plot:
plot(iris_data$Sepal.Length,iris_data$Sepal.Width,col=iris_data$color) points(k1[1],k1[2],pch=4) points(k2[1],k2[2],pch=5) points(k3[1],k3[2],pch=6)
The output is as follows:
Figure 1.37: Scatter plot representing different species in different colors
Activity 2: Customer Segmentation with k-means
Solution:
Download the data from https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Activity02/wholesale_customers_data.csv.
Read the data into the ws variable:
ws<-read.csv('wholesale_customers_data.csv')
Store only column 5 and 6 in the ws variable by discarding the rest of the columns:
ws<-ws[5:6]
Import the factoextra library:
library(factoextra)
Calculate the cluster centers for two centers:
clus<-kmeans(ws,2)
Plot the chart for two clusters:
fviz_cluster(clus,data=ws)
The output is as follows:
Figure 1.38: Chart for two clusters
Notice how outliers are also part of the two clusters.
Calculate the cluster centers for three clusters:
clus<-kmeans(ws,3)
Plot the chart for three clusters:
fviz_cluster(clus,data=ws)
The output is as follows:
Figure 1.39: Chart for three clusters
Notice some outliers are now a part of a separate cluster.
Calculate the cluster centers for four centers:
clus<-kmeans(ws,4)
Plot the chart for four clusters:
fviz_cluster(clus,data=ws)
The output is as follows:
Figure 1.40: Chart for four clusters
Notice how outliers have started separating in two different clusters.
Calculate the cluster centers for five clusters:
clus<-kmeans(ws,5)
Plot the chart for five clusters:
fviz_cluster(clus,data=ws)
The output is as follows:
Figure 1.41: Chart for five clusters
Notice how outliers have clearly formed two separate clusters in red and blue, while the rest of the data is classified in three different clusters.
Calculate the cluster centers for six clusters:
clus<-kmeans(ws,6)
Plot the chart for six clusters:
fviz_cluster(clus,data=ws)
The output is as follows:
Figure 1.42: Chart for six clusters
Activity 3: Performing Customer Segmentation with k-medoids Clustering
Solution:
Read the CSV file into the ws variable:
ws<-read.csv('wholesale_customers_data.csv')
Store only columns 5 and 6 in the ws variable:
ws<-ws[5:6]
Import the factoextra library for visualization:
library(factoextra)
Import the cluster library for clustering by PAM:
library(cluster)
Calculate clusters by entering data and the number of clusters in the pam function:
clus<-pam(ws,4)
Plot a visualization of the clusters:
fviz_cluster(clus,data=ws)
The output is as follows:
Figure 1.43: K-medoid plot of the clusters
Again, calculate the clusters with k-means and plot the output to compare with the output of the pam clustering:
clus<-kmeans(ws,4) fviz_cluster(clus,data=ws)
The output is as follows:
Figure 1.44: K-means plot of the clusters
Activity 4: Finding the Ideal Number of Market Segments
Solution:
Read the downloaded dataset into the ws variable:
ws<-read.csv('wholesale_customers_data.csv')
Store only columns 5 and 6 in the variable by discarding other columns:
ws<-ws[5:6]
Calculate the optimal number of clusters with the silhouette score:
fviz_nbclust(ws, kmeans, method = "silhouette",k.max=20)
Here is the output:
Figure 1.45: Graph representing optimal number of clusters with the silhouette score
The optimal number of clusters, according to the silhouette score, is two.
Calculate the optimal number of clusters with the WSS score:
fviz_nbclust(ws, kmeans, method = "wss", k.max=20)
Here is the output:
Figure 1.46: Optimal number of clusters with the WSS score
The optimum number of clusters according to the WSS elbow method is around six.
Calculate the optimal number of clusters with the Gap statistic:
fviz_nbclust(ws, kmeans, method = "gap_stat",k.max=20)
Here is the output:
Figure 1.47: Optimal number of clusters with the Gap statistic
The optimal number of clusters according to the Gap statistic is one.