Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Applied Unsupervised Learning with R

You're reading from   Applied Unsupervised Learning with R Uncover hidden relationships and patterns with k-means clustering, hierarchical clustering, and PCA

Arrow left icon
Product type Paperback
Published in Mar 2019
Publisher
ISBN-13 9781789956399
Length 320 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Bradford Tuckfield Bradford Tuckfield
Author Profile Icon Bradford Tuckfield
Bradford Tuckfield
Alok Malik Alok Malik
Author Profile Icon Alok Malik
Alok Malik
Arrow right icon
View More author details
Toc

Chapter 6: Anomaly Detection


Activity 14: Finding Univariate Anomalies Using a Parametric Method and a Non-parametric Method

Solution:

  1. Load the data as follows:

    data(islands)
  2. Draw a boxplot as follows:

    boxplot(islands)

    Figure 6.21: Boxplot of the islands dataset

    You should notice that the data is extremely fat-tailed, meaning that the median and interquartile range take up a relatively tiny portion of the plot compared to the many observations that R has classified as outliers.

  1. Create a new log-transformed dataset as follows:

    log_islands<-log(islands)
  2. Create a boxplot of the log-transformed data as follows:

    boxplot(log_islands)

    Figure 6.22: Boxplot of log-transformed dataset

    You should notice that there are only five outliers after the log transformation.

  3. Calculate the interquartile range:

    interquartile_range<-quantile(islands,.75)-quantile(islands,.25)
  4. Add 1.5 times the interquartile range to the third quartile to get the upper limit of the non-outlier data:

    upper_limit<-quantile(islands,.75)+1.5*interquartile_range
  5. Classify outliers as any observations above this upper limit:

    outliers<-islands[which(islands>upper_limit)]
  6. Calculate the interquartile range for the log-transformed data:

    interquartile_range_log<-quantile(log_islands,.75)-quantile(log_islands,.25)
  7. Add 1.5 times the interquartile range to the third quartile to get the upper limit of the non-outlier data:

    upper_limit_log<-quantile(log_islands,.75)+1.5*interquartile_range_log
  8. Classify outliers as any observations above this upper limit:

    outliers_log<-islands[which(log_islands>upper_limit_log)]
  9. Print the non-transformed outliers as follows:

    print(outliers)

    For the non-transformed outliers, we obtain the following:

    Figure 6.23: Non-transformed outliers

    Print the log-transformed outliers as follows:

    print(outliers_log)

    For the log-transformed outliers, we obtain the following:

    Figure 6.24: Log-transformed outliers

  10. Calculate the mean and standard deviation of the data:

    island_mean<-mean(islands)
    island_sd<-sd(islands)
  11. Select observations that are more than two standard deviations away from the mean:

    outliers<-islands[which(islands>(island_mean+2*island_sd))]
    outliers

    We obtain the following outliers:

    Figure 6.25: Screenshot of the outliers

  12. First, we calculate the mean and standard deviation of the log-transformed data:

    island_mean_log<-mean(log_islands)
    island_sd_log<-sd(log_islands)
  13. Select observations that are more than two standard deviations away from the mean:

    outliers_log<-log_islands[which(log_islands>(island_mean_log+2*island_sd_log))]
  14. We print the log-transformed outliers as follows:

    print(outliers_log)

    The output is as follows:

    Figure 6.26: Log-transformed outliers

Activity 15: Using Mahalanobis Distance to Find Anomalies

Solution:

  1. You can load and plot the data as follows:

    data(cars)
    plot(cars)

    The output plot is the following:

    Figure 6.27: Plot of the cars dataset

  2. Calculate the centroid:

    centroid<-c(mean(cars$speed),mean(cars$dist))
  3. Calculate the covariance matrix:

    cov_mat<-cov(cars)
  4. Calculate the inverse of the covariance matrix:

    inv_cov_mat<-solve(cov_mat)
  5. Create a NULL variable, which will hold each of our calculated distances:

    all_distances<-NULL
  6. We can loop through each observation and find the Mahalanobis distance between them and the centroid of the data:

    k<-1
    while(k<=nrow(cars)){
    the_distance<-cars[k,]-centroid
    mahalanobis_dist<-t(matrix(as.numeric(the_distance)))%*% matrix(inv_cov_mat,nrow=2) %*% matrix(as.numeric(the_distance))
    all_distances<-c(all_distances,mahalanobis_dist)
    k<-k+1
    }
  7. Plot all observations that have particularly high Mahalanobis distances to see our outliers:

    plot(cars)
    points(cars$speed[which(all_distances>quantile(all_distances,.9))], cars$dist[which(all_distances>quantile(all_distances,.9))],col='red',pch=19)

    We can see the output plot as follows, with the outlier points shown in red:

    Figure 6.28: Plot with outliers marked

lock icon The rest of the chapter is locked
arrow left Previous Section
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image