Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Applied Unsupervised Learning with R

You're reading from   Applied Unsupervised Learning with R Uncover hidden relationships and patterns with k-means clustering, hierarchical clustering, and PCA

Arrow left icon
Product type Paperback
Published in Mar 2019
Publisher
ISBN-13 9781789956399
Length 320 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Bradford Tuckfield Bradford Tuckfield
Author Profile Icon Bradford Tuckfield
Bradford Tuckfield
Alok Malik Alok Malik
Author Profile Icon Alok Malik
Alok Malik
Arrow right icon
View More author details
Toc

Chapter 2: Advanced Clustering Methods


Activity 5: Implementing k-modes Clustering on the Mushroom Dataset

Solution:

  1. Download mushrooms.csv from https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/blob/master/Lesson02/Activity05/mushrooms.csv.

  2. After downloading, load the mushrooms.csv file in R:

    ms<-read.csv('mushrooms.csv')
  3. Check the dimensions of the dataset:

    dim(ms)

    The output is as follows:

    [1] 8124   23
  4. Check the distribution of all columns:

    summary.data.frame(ms)

    The output is as follows:

    Figure 2.29: Screenshot of the summary of distribution of all columns

    Each column contains all the unique labels and their count.

  5. Store all the columns of the dataset, except for the final label, in a new variable, ms_k:

    ms_k<-ms[,2:23]
  6. Import the klaR library, which has the kmodes function:

    install.packages('klaR')
    library(klaR)
  7. Calculate kmodes clusters and store them in a kmodes_ms variable. Enter the dataset without true labels as the first parameter and enter the number of clusters as the second parameter:

    kmodes_ms<-kmodes(ms_k,2)
  8. Check the results by creating a table of true labels and cluster labels:

    result = table(ms$class, kmodes_ms$cluster)
    result

    The output is as follows:

           1    2
      e   80 4128
      p 3052  864

As you can see, most of the edible mushrooms are in cluster 2 and most of the poisonous mushrooms are in cluster 1. So, using k-modes clustering has done a reasonable job of identifying whether each mushroom is edible or poisonous.

Activity 6: Implementing DBSCAN and Visualizing the Results

Solution:

  1. Import the dbscan and factoextra library:

    library(dbscan)
    library(factoextra)
  2. Import the multishapes dataset:

    data(multishapes)
  3. Put the columns of the multishapes dataset in the ms variable:

    ms<-multishapes[,1:2]
  4. Plot the dataset as follows:

    plot(ms)

    The output is as follows:

    Figure 2.30: Plot of the multishapes dataset

  1. Perform k-means clustering on the dataset and plot the results:

    km.res<-kmeans(ms,4)
    fviz_cluster(km.res, ms,ellipse = FALSE)

    The output is as follows:

    Figure 2.31: Plot of k-means on the multishapes dataset

  1. Perform DBSCAN on the ms variable and plot the results:

    db.res<-dbscan(ms,eps = .15)
    fviz_cluster(db.res, ms,ellipse = FALSE,geom = 'point')

    The output is as follows:

    Figure 2.32: Plot of DBCAN on the multishapes dataset

Here, you can see all the points in black are anomalies and are not present in any cluster, and the clusters formed in DBSCAN are not possible with any other type of clustering method. These clusters have taken all types of shapes and sizes, whereas in k-means, all clusters are of a spherical shape.

Activity 7: Performing a Hierarchical Cluster Analysis on the Seeds Dataset

Solution:

  1. Read the downloaded file into the sd variable:

    sd<-read.delim('seeds_dataset.txt')

    Note

    Make changes to the path as per the location of the file on your system.

  1. First, put all the columns of the dataset other than final labels into the sd_c variable:

    sd_c<-sd[,1:7]
  2. Import the cluster library:

    library(cluster)
  3. Calculate the hierarchical clusters and plot the dendrogram:

    h.res<-hclust(dist(sd_c),"ave")
    plot(h.res)

    The output is as follows:

    Figure 2.33: Cluster dendrogram

  4. Cut the tree at k=3 and plot a table to see how the results of the clustering have performed at classifying the three types of seeds:

    memb <- cutree(h.res, k = 3)
    results<-table(sd$X1,memb)
    results

    The output is as follows:

    Figure 2.34: Table classifying the three types of seeds

  5. Perform divisive clustering on the sd_c dataset and plot the dendrogram:

    d.res<-diana(sd_c,metric ="euclidean",)
    plot(d.res)

    The output is as follows:

    Figure 2.35: Dendrogram of divisive clustering

  6. Cut the tree at k=3 and plot a table to see how the results of the clustering have performed at classifying the three types of seeds:

    memb <- cutree(h.res, k = 3)
    results<-table(sd$X1,memb)
    results

    The output is as follows:

    Figure 2.36: Table classifying the three types of seeds

You can see that both types of clustering methods have produced identical results. These results also demonstrate that divisive clustering is the reverse of hierarchical clustering.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image