This is a heuristic algorithm, so it is a good idea to run the process several times. It will normally run quickly in R, as the work in each step is not difficult. The objective is to minimize the sum of squares by constant refining of the terms.
Predetermining the number of clusters may be problematic. Graphing the data (or its squares or the like) should present logical groupings for your data visually. You can determine group sizes by iterating through the steps to determine the cutoff for selection (we will use that later in this chapter). There are other R packages that will attempt to compute this as well. You should also verify the fit of the clusters selected upon completion.
Using an average (in step 3) shows that k-means does not work well with fairly sparse data or data with a larger number of outliers. Furthermore, there can be a problem if the cluster is not in a nice, linear shape. Graphical representation should prove whether your data fits this algorithm.
K-means clustering is performed in R programming with the
kmeans
function. The R programming usage of k-means clustering follows the convention given here (note that you can always determine the conventions for a function using the inline help function, for example, ?kmeans
, to get this information):
The various parameters are explained in the following table:
Calling the kmeans
function returns a kmeans
object with the following properties:
First, generate a hundred pairs of random numbers in a normal distribution and assign it to the matrix x
as follows:
We can display the values we generate as follows:
The the resultant kmeans
object values can be determined and displayed (using 10 clusters) as follows:
If we look at the results, we find some interesting data points:
- The
Cluster means
shows the breakdown of the means used for the cluster assignments. - The
Clustering vector
shows which cluster each of the 100 numbers was assigned to. - The
Cluster sum of squares
shows the totss
value, as described in the output. - The percentage value is the
betweenss
value divided as a percentage of the totss
value. At 94.6 percent, we have a very good fit.
We chose an arbitrary cluster size of 10, but we should verify that this is a good number to use. If we were to run the kmeans
function a number of times using a range of cluster sizes, we would end up with a graph that looks like the one in the following example.
For example, if we ran the following code and recorded the results, the output will be as follows:
If the data were more distributed, there would be a clear demarcation about the maximum number of clusters, as further clustering will show no improvement in the sum of squares. However, since we used very smooth data for the test, the number of clusters could be allowed to increase.
Once your clusters have been determined, you should be able to gather a visual representation, as shown in the following plot: