Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Applied Unsupervised Learning with R

You're reading from   Applied Unsupervised Learning with R Uncover hidden relationships and patterns with k-means clustering, hierarchical clustering, and PCA

Arrow left icon
Product type Paperback
Published in Mar 2019
Publisher
ISBN-13 9781789956399
Length 320 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Bradford Tuckfield Bradford Tuckfield
Author Profile Icon Bradford Tuckfield
Bradford Tuckfield
Alok Malik Alok Malik
Author Profile Icon Alok Malik
Alok Malik
Arrow right icon
View More author details
Toc

Introduction to k-means Clustering with Built-In Functions


In this section, we're going to use some built-in libraries of R to perform k-means clustering instead of writing custom code, which is lengthy and prone to bugs and errors. Using pre-built libraries instead of writing our own code has other advantages, too:

  • Library functions are computationally efficient, as thousands of man hours have gone into the development of those functions.

  • Library functions are almost bug-free as they've been tested by thousands of people in almost all practically-usable scenarios.

  • Using libraries saves time, as you don't have to invest time in writing your own code.

k-means Clustering with Three Clusters

In the previous activity, we performed k-means clustering with three clusters by writing our own code. In this section, we're going to achieve a similar result with the help of pre-built R libraries.

At first, we're going to start with a distribution of three types of flowers in our dataset, as represented in the following graph:

Figure 1.17: A graph representing three species of iris in three colors

In the preceding plot, setosa is represented in blue, virginica in gray, and versicolor in pink.

With this dataset, we're going to perform k-means clustering and see whether the built-in algorithm is able to find a pattern on its own to classify these three species of iris using their sepal sizes. This time, we're going to use just four lines of code.

Exercise 3: k-means Clustering with R Libraries

In this exercise, we're going to learn to do k-means clustering in a much easier way with the pre-built libraries of R. By completing this exercise, you will be able to divide the three species of Iris into three separate clusters:

  1. We put the first two columns of the iris dataset, sepal length and sepal width, in the iris_data variable:

    iris_data<-iris[,1:2]
  2. We find the k-means cluster centers and the cluster to which each point belongs, and store it all in the km.res variable. Here, in the kmeans, function we enter the dataset as the first parameter, and the number of clusters we want as the second parameter:

    km.res<-kmeans(iris_data,3)

    Note

    The k-means function has many input variables, which can be altered to get different final outputs. You can find out more about them here in the documentation at https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/kmeans.

  3. Install the factoextra library as follows:

    install.packages('factoextra')
  4. We import the factoextra library for visualization of the clusters we just created. Factoextra is an R package that is used for plotting multivariate data:

    library("factoextra") 
  5. Generate the plot of the clusters. Here, we need to enter the results of k-means as the first parameter. In data, we need to enter the data on which clustering was done. In pallete, we're selecting the type of the geometry of points, and in ggtheme, we're selecting the theme of the output plot:

    fviz_cluster(km.res, data = iris_data,palette = "jco",ggtheme = theme_minimal())

    The output will be as follows:

    Figure 1.18: Three species of Iris have been clustered into three clusters

Here, if you compare Figure 1.18 to Figure 1.17, you will see that we have classified all three species almost correctly. The clusters we've generated don't exactly match the species shown in figure 1.18, but we've come very close considering the limitations of only using sepal length and width to classify them.

You can see from this example that clustering would've been a very useful way of categorizing the irises if we didn't already know their species. You will come across many examples of datasets where you don't have labeled categories, but are able to use clustering to form your own groupings.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime