Packt+ | Advance your knowledge in tech

You're reading from Applied Unsupervised Learning with R Uncover hidden relationships and patterns with k-means clustering, hierarchical clustering, and PCA

Product type Paperback

Published in Mar 2019

Publisher

ISBN-13 9781789956399

Length 320 pages

Edition 1st Edition

Languages

Concepts

Machine Learning

Authors (2):

Bradford Tuckfield

Alok Malik

View More author details

Table of Contents (9) Chapters

Applied Unsupervised Learning with R

Preface

1. Introduction to Clustering Methods FREE CHAPTER

2. Advanced Clustering Methods

3. Probability Distributions

4. Dimension Reduction

5. Data Comparison Methods

6. Anomaly Detection

Appendix

Introduction to k-means Clustering with Built-In Functions

In this section, we're going to use some built-in libraries of R to perform k-means clustering instead of writing custom code, which is lengthy and prone to bugs and errors. Using pre-built libraries instead of writing our own code has other advantages, too:

Library functions are computationally efficient, as thousands of man hours have gone into the development of those functions.
Library functions are almost bug-free as they've been tested by thousands of people in almost all practically-usable scenarios.
Using libraries saves time, as you don't have to invest time in writing your own code.

k-means Clustering with Three Clusters

In the previous activity, we performed k-means clustering with three clusters by writing our own code. In this section, we're going to achieve a similar result with the help of pre-built R libraries.

At first, we're going to start with a distribution of three types of flowers in our dataset, as represented in the following graph:

Figure 1.17: A graph representing three species of iris in three colors

In the preceding plot, setosa is represented in blue, virginica in gray, and versicolor in pink.

With this dataset, we're going to perform k-means clustering and see whether the built-in algorithm is able to find a pattern on its own to classify these three species of iris using their sepal sizes. This time, we're going to use just four lines of code.

Exercise 3: k-means Clustering with R Libraries

In this exercise, we're going to learn to do k-means clustering in a much easier way with the pre-built libraries of R. By completing this exercise, you will be able to divide the three species of Iris into three separate clusters:

We put the first two columns of the iris dataset, sepal length and sepal width, in the iris_data variable:
```
iris_data<-iris[,1:2]
```
We find the k-means cluster centers and the cluster to which each point belongs, and store it all in the km.res variable. Here, in the kmeans, function we enter the dataset as the first parameter, and the number of clusters we want as the second parameter:
```
km.res<-kmeans(iris_data,3)
```
Note
The k-means function has many input variables, which can be altered to get different final outputs. You can find out more about them here in the documentation at https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/kmeans.
Install the factoextra library as follows:
```
install.packages('factoextra')
```
We import the factoextra library for visualization of the clusters we just created. Factoextra is an R package that is used for plotting multivariate data:
```
library("factoextra") 
```
Generate the plot of the clusters. Here, we need to enter the results of k-means as the first parameter. In data, we need to enter the data on which clustering was done. In pallete, we're selecting the type of the geometry of points, and in ggtheme, we're selecting the theme of the output plot:
```
fviz_cluster(km.res, data = iris_data,palette = "jco",ggtheme = theme_minimal())
```
The output will be as follows:
Figure 1.18: Three species of Iris have been clustered into three clusters

Here, if you compare Figure 1.18 to Figure 1.17, you will see that we have classified all three species almost correctly. The clusters we've generated don't exactly match the species shown in figure 1.18, but we've come very close considering the limitations of only using sepal length and width to classify them.

You can see from this example that clustering would've been a very useful way of categorizing the irises if we didn't already know their species. You will come across many examples of datasets where you don't have labeled categories, but are able to use clustering to form your own groupings.