Chapter 1: Introduction to Clustering Methods
Activity 1: k-means Clustering with Three Clusters
Solution:
Load the Iris dataset in the iris_data variable:
iris_data<-iris
Create a t_color column and make its default value red. Change the value of the two species to green and blue so the third one remains red:
iris_data$t_color='red' iris_data$t_color[which(iris_data$Species=='setosa')]<-'green' iris_data$t_color[which(iris_data$Species=='virginica')]<-'blue'
Note
Here, we change the color column of only those values whose species is setosa or virginica)
Choose any three random cluster centers:
k1<-c(7,3) k2<-c(5,3) k3<-c(6,2.5)
Plot the x, y plot by entering the sepal length and sepal width in the plot() function, along with color:
plot(iris_data$Sepal.Length,iris_data$Sepal.Width,col=iris_data$t_color) points(k1[1],k1[2],pch=4) points(k2[1],k2[2],pch=5) points(k3[1],k3[2],pch=6)
Here is the output:
Choose a number of iterations:
number_of_steps<-10
Choose an the initial value of n:
n<-1
Start the while loop for finding the cluster centers:
while(n<number_of_steps){
Calculate the distance of each point from the current cluster centers. We're calculating the Euclidean distance here using the sqrt function:
iris_data$distance_to_clust1 <- sqrt((iris_data$Sepal.Length-k1[1])^2+(iris_data$Sepal.Width-k1[2])^2) iris_data$distance_to_clust2 <- sqrt((iris_data$Sepal.Length-k2[1])^2+(iris_data$Sepal.Width-k2[2])^2) iris_data$distance_to_clust3 <- sqrt((iris_data$Sepal.Length-k3[1])^2+(iris_data$Sepal.Width-k3[2])^2)
Assign each point to a cluster to whose center it is closest:
iris_data$clust_1 <- 1*(iris_data$distance_to_clust1<=iris_data$distance_to_clust2 & iris_data$distance_to_clust1<=iris_data$distance_to_clust3) iris_data$clust_2 <- 1*(iris_data$distance_to_clust1>iris_data$distance_to_clust2 & iris_data$distance_to_clust3>iris_data$distance_to_clust2) iris_data$clust_3 <- 1*(iris_data$distance_to_clust3<iris_data$distance_to_clust1 & iris_data$distance_to_clust3<iris_data$distance_to_clust2)
Calculate new cluster centers by calculating the mean x and y coordinates of each center with the mean() function in R:
k1[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_1==1)]) k1[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_1==1)]) k2[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_2==1)]) k2[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_2==1)]) k3[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_3==1)]) k3[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_3==1)]) n=n+1 }
Choose the color for each center to plot a scatterplot:
iris_data$color='red' iris_data$color[which(iris_data$clust_2==1)]<-'blue' iris_data$color[which(iris_data$clust_3==1)]<-'green'
Plot the final plot:
plot(iris_data$Sepal.Length,iris_data$Sepal.Width,col=iris_data$color) points(k1[1],k1[2],pch=4) points(k2[1],k2[2],pch=5) points(k3[1],k3[2],pch=6)
The output is as follows:
Activity 2: Customer Segmentation with k-means
Solution:
Download the data from https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Activity02/wholesale_customers_data.csv.
Read the data into the ws variable:
ws<-read.csv('wholesale_customers_data.csv')
Store only column 5 and 6 in the ws variable by discarding the rest of the columns:
ws<-ws[5:6]
Import the factoextra library:
library(factoextra)
Calculate the cluster centers for two centers:
clus<-kmeans(ws,2)
Plot the chart for two clusters:
fviz_cluster(clus,data=ws)
The output is as follows:
Notice how outliers are also part of the two clusters.
Calculate the cluster centers for three clusters:
clus<-kmeans(ws,3)
Plot the chart for three clusters:
fviz_cluster(clus,data=ws)
The output is as follows:
Notice some outliers are now a part of a separate cluster.
Calculate the cluster centers for four centers:
clus<-kmeans(ws,4)
Plot the chart for four clusters:
fviz_cluster(clus,data=ws)
The output is as follows:
Notice how outliers have started separating in two different clusters.
Calculate the cluster centers for five clusters:
clus<-kmeans(ws,5)
Plot the chart for five clusters:
fviz_cluster(clus,data=ws)
The output is as follows:
Notice how outliers have clearly formed two separate clusters in red and blue, while the rest of the data is classified in three different clusters.
Calculate the cluster centers for six clusters:
clus<-kmeans(ws,6)
Plot the chart for six clusters:
fviz_cluster(clus,data=ws)
The output is as follows:
Activity 3: Performing Customer Segmentation with k-medoids Clustering
Solution:
Read the CSV file into the ws variable:
ws<-read.csv('wholesale_customers_data.csv')
Store only columns 5 and 6 in the ws variable:
ws<-ws[5:6]
Import the factoextra library for visualization:
library(factoextra)
Import the cluster library for clustering by PAM:
library(cluster)
Calculate clusters by entering data and the number of clusters in the pam function:
clus<-pam(ws,4)
Plot a visualization of the clusters:
fviz_cluster(clus,data=ws)
The output is as follows:
Again, calculate the clusters with k-means and plot the output to compare with the output of the pam clustering:
clus<-kmeans(ws,4) fviz_cluster(clus,data=ws)
The output is as follows:
Activity 4: Finding the Ideal Number of Market Segments
Solution:
Read the downloaded dataset into the ws variable:
ws<-read.csv('wholesale_customers_data.csv')
Store only columns 5 and 6 in the variable by discarding other columns:
ws<-ws[5:6]
Calculate the optimal number of clusters with the silhouette score:
fviz_nbclust(ws, kmeans, method = "silhouette",k.max=20)
Here is the output:
The optimal number of clusters, according to the silhouette score, is two.
Calculate the optimal number of clusters with the WSS score:
fviz_nbclust(ws, kmeans, method = "wss", k.max=20)
Here is the output:
The optimum number of clusters according to the WSS elbow method is around six.
Calculate the optimal number of clusters with the Gap statistic:
fviz_nbclust(ws, kmeans, method = "gap_stat",k.max=20)
Here is the output:
The optimal number of clusters according to the Gap statistic is one.