Introduction to k-means Clustering
K-means clustering is one of the most basic types of unsupervised learning algorithm. This algorithm finds natural groupings in accordance with a predefined similarity or distance measure. The distance measure can be any of the following:
Euclidean distance
Manhattan distance
Cosine distance
Hamming distance
To understand what a distance measure does, take the example of a bunch of pens. You have 12 pens. Six of them are blue, and six red. Six of them are ball point pens and six are ink pens. If you were to use ink color as a similarity measure, then six blue pens and six red pens will be in different clusters. The six blue pens can be ink or ball point here; there's no restriction on that. But if you were to use the type of pen as the similarity measure, then the six ink pens and six ball point pens would be in different clusters. Now it doesn't matter whether the pens in each cluster are of the same color or not.
Euclidean Distance
Euclidean distance is the straight-line distance between any two points. Calculation of this distance in two dimensions can be thought of an extension of the Pythagorean theorem, which you might have studied in school. But Euclidean distance can be calculated between two points in any n-dimensional space, not just a two-dimensional space. The Euclidean distance between any two points is the square root of the sum of squares of differences between the coordinates. An example of the calculation of Euclidean distance is presented here:
In k-means clustering, Euclidean distance is used. One disadvantage of using Euclidean distance is that it loses its meaning when the dimensionality of data is very high. This is related to a phenomenon known as the curse of dimensionality. When datasets possess many dimensions, they can be harder to work with, since distances between all points can become extremely high, and the distances are difficult to interpret and visualize.
So, when the dimensionality of data is very high, either we reduce its dimensions with principal component analysis, which we're going to study in Chapter 4, Dimension Reduction, or we use cosine similarity.
Manhattan Distance
By definition, Manhattan distance is the distance between two points measured along a right angle to the axes:
The length of the diagonal line is the Euclidean distance between the two points. Manhattan distance is simply the sum of the absolute value of the differences between two coordinates. So, the main difference between Euclidean distance and Manhattan distance is that with Euclidean distance, we square the distances between coordinates and then take the root of the sum, but in Manhattan distance, we directly take the sum of the absolute value of the differences between coordinates.
Cosine Distance
Cosine similarity between any two points is defined as the cosine of the angle between any two points with the origin as its vertex. It can be calculated by dividing the dot product of any two vectors by the product of the magnitudes of the vectors:
Cosine distance is defined as (1-cosine similarity).
Cosine distance varies from 0 to 2, whereas cosine similarity varies between -1 to 1. Always remember that cosine similarity is one minus the value of of the cosine distance.
The Hamming Distance
The Hamming distance is a special type of distance that is used for categorical variables. Given two points of equal dimensions, the Hamming distance is defined as the number of coordinates differing from one another. For example, let's take two points, (0, 1, 1) and (0, 1, 0). Only one value, which is the last value, is different between these two variables. As such, the Hamming distance between them is 1:
k-means Clustering Algorithm
K-means clustering is used to find clusters in a dataset of similar points when we have unlabeled data. In this chapter, we're going to use the Iris flowers dataset. This dataset contains information about the length and breadth of sepals and petals of flowers of different species. With the help of unsupervised learning, we're going to learn how to differentiate between them without knowing which properties belong to which species. The following is the scatter plot of our dataset:
This is the scatter plot of two variables of the Iris flower dataset: sepal length and sepal width.
If we were to identify the clusters in the preceding dataset according to the distance between the points, we would choose clusters that look like bunches of grapes hanging from a tree. You can see that there are two major bunches (one in the top left and the other being the remaining points). The k-means algorithm identifies these "bunches" of grapes.
The following figure shows the same scatter plot, but with the three different species of Iris shown in different colors. These species are taken from the 'species' column of the original dataset, and are as follows: Iris setosa (shown in green), Iris versicolor (shown in red), and Iris virginica (shown in blue). We're going to see whether we can determine these species by forming our own classifications using clustering:
Here is a photo of Iris setosa, which is represented in green in the preceding scatter plot:
The following is a photo of Iris versicolor, which is represented in red in the preceding scatter plot:
Here is a photo of Iris virginica, which is represented in blue in the preceding scatter plot:
Steps to Implement k-means Clustering
As we saw in the scatter plot in figure 1.9, each data point represents a flower. We're going to find clusters that will identify these species. To do this type of clustering, we're going to use k-means clustering, where k is the number of clusters we want. The following are the steps to perform k-means clustering, which, for simplicity of understanding, we're going to demonstrate with two clusters. We will build up to using three clusters later, in order to try and match the actual species groupings:
Choose any two random coordinates, k1 and k2, on the scatter plot as initial cluster centers.
Calculate the distance of each data point in the scatter plot from coordinates k1 and k2.
Assign each data point to a cluster based on whether it is closer to k1 or k2
Find the mean coordinates of all points in each cluster and update the values of k1 and k2 to those coordinates respectively.
Start again from Step 2 until the coordinates of k1 and k2 stop moving significantly, or after a certain pre-determined number of iterations of the process.
We're going to demonstrate the preceding algorithm with graphs and code.
Exercise 2: Implementing k-means Clustering on the Iris Dataset
In this exercise, we will implement k-means clustering step by step:
Load the built-in Iris dataset in the iris_data variable:
iris_data<-iris
Set the color for different species for representation on the scatter plot. This will help us see how the three different species are split between our initial two groupings:
iris_data$t_color='red' iris_data$t_color[which(iris_data$Species=='setosa')]<-'green' iris_data$t_color[which(iris_data$Species=='virginica')]<-'blue'
Choose any two random clusters' centers to start with:
k1<-c(7,3) k2<-c(5,3)
Note
You can try changing the points and see how it affects the final clusters.
Plot the scatter plot along with the centers you chose in the previous step. Pass the length and width of the sepals of the iris flowers, along with the color, to the plot function in the first line, and then pass x and y coordinates of both the centers to the points() function. Here, pch is for selecting the type of representation of the center of the clusters – in this case, 4 is a cross and 5 is a diamond:
plot(iris_data$Sepal.Length,iris_data$Sepal.Width,col=iris_data$t_color) points(k1[1],k1[2],pch=4) points(k2[1],k2[2],pch=5)
The output is as follows:
Choose the number of iterations you want. The number of iterations should be such that the centers stop changing significantly after each iteration. In our case, six iterations are sufficient:
number_of_steps<-6
Initialize the variable that will keep track of the number of iterations in the loop:
n<-1
Start the while loop to find the final cluster centers:
while(n<number_of_steps) {
Calculate the distance of each point from the current cluster centers, which is Step 2 in the algorithm. We're calculating the Euclidean distance here using the sqrt function:
iris_data$distance_to_clust1 <- sqrt((iris_data$Sepal.Length-k1[1])^2+(iris_data$Sepal.Width-k1[2])^2) iris_data$distance_to_clust2 <- sqrt((iris_data$Sepal.Length-k2[1])^2+(iris_data$Sepal.Width-k2[2])^2)
Assign each point to the cluster whose center it is closest to, which is Step 3 of the algorithm:
iris_data$clust_1 <- 1*(iris_data$distance_to_clust1<=iris_data$distance_to_clust2) iris_data$clust_2 <- 1*(iris_data$distance_to_clust1>iris_data$distance_to_clust2)
Calculate new cluster centers by calculating the mean x and y coordinates of the point in each cluster (step 4 in the algorithm) with the mean() function in R:
k1[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_1==1)]) k1[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_1==1)]) k2[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_2==1)]) k2[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_2==1)])
Update the variable that is keeping the count of iterations for us to effectively carry out step 5 of the algorithm:
n=n+1 }
Now we're going to overwrite the species colors with new colors to demonstrate the two clusters. So, there will only be two colors on our next scatter plot – one color for cluster 1, and one color for cluster 2:
iris_data$color='red' iris_data$color[which(iris_data$clust_2==1)]<-'blue'
Plot the new scatter plot, which contains clusters along with cluster centers:
plot(iris_data$Sepal.Length,iris_data$Sepal.Width,col=iris_data$color) points(k1[1],k1[2],pch=4) points(k2[1],k2[2],pch=5)
The output will be as follows:
Notice how setosa (which used to be green) has been grouped in the left cluster, while most of the virginica flowers (which were blue) have been grouped into the right cluster. The versicolor flowers (which were red) have been split between the two new clusters.
You have successfully implemented the k-means clustering algorithm to identify two groups of flowers based on their sepal size. Notice how the position of centers has changed after running the algorithm.
In the following activity, we are going to increase the number of clusters to three to see whether we can group the flowers correctly into their three different species.
Activity 1: k-means Clustering with Three Clusters
Write an R program to perform k-means clustering on the Iris dataset using three clusters. In this activity, we're going to perform the following steps:
Choose any three random coordinates, k1, k2, and k3, on the plot as centers.
Calculate the distance of each data point from k1, k2, and k3.
Classify each point by the cluster whose center it is closest to.
Find the mean coordinates of all points in the respective clusters and update the values of k1, k2, and k3 to those values.
Start again from Step 2 until the coordinates of k1, k2, and k3 stop moving significantly, or after 10 iterations of the process.
The outcome of this activity will be a chart with three clusters, as follows:
You can compare your chart to Figure 1.10 to see how well the clusters match the actual species classifications.
Note
The solution for this activity can be found on page 198.