Chapter 6: Anomaly Detection
Activity 14: Finding Univariate Anomalies Using a Parametric Method and a Non-parametric Method
Solution:
Load the data as follows:
data(islands)
Draw a boxplot as follows:
boxplot(islands)
You should notice that the data is extremely fat-tailed, meaning that the median and interquartile range take up a relatively tiny portion of the plot compared to the many observations that R has classified as outliers.
Create a new log-transformed dataset as follows:
log_islands<-log(islands)
Create a boxplot of the log-transformed data as follows:
boxplot(log_islands)
You should notice that there are only five outliers after the log transformation.
Calculate the interquartile range:
interquartile_range<-quantile(islands,.75)-quantile(islands,.25)
Add 1.5 times the interquartile range to the third quartile to get the upper limit of the non-outlier data:
upper_limit<-quantile(islands,.75)+1.5*interquartile_range
Classify outliers as any observations above this upper limit:
outliers<-islands[which(islands>upper_limit)]
Calculate the interquartile range for the log-transformed data:
interquartile_range_log<-quantile(log_islands,.75)-quantile(log_islands,.25)
Add 1.5 times the interquartile range to the third quartile to get the upper limit of the non-outlier data:
upper_limit_log<-quantile(log_islands,.75)+1.5*interquartile_range_log
Classify outliers as any observations above this upper limit:
outliers_log<-islands[which(log_islands>upper_limit_log)]
Print the non-transformed outliers as follows:
print(outliers)
For the non-transformed outliers, we obtain the following:
Print the log-transformed outliers as follows:
print(outliers_log)
For the log-transformed outliers, we obtain the following:
Calculate the mean and standard deviation of the data:
island_mean<-mean(islands) island_sd<-sd(islands)
Select observations that are more than two standard deviations away from the mean:
outliers<-islands[which(islands>(island_mean+2*island_sd))] outliers
We obtain the following outliers:
First, we calculate the mean and standard deviation of the log-transformed data:
island_mean_log<-mean(log_islands) island_sd_log<-sd(log_islands)
Select observations that are more than two standard deviations away from the mean:
outliers_log<-log_islands[which(log_islands>(island_mean_log+2*island_sd_log))]
We print the log-transformed outliers as follows:
print(outliers_log)
The output is as follows:
Activity 15: Using Mahalanobis Distance to Find Anomalies
Solution:
You can load and plot the data as follows:
data(cars) plot(cars)
The output plot is the following:
Calculate the centroid:
centroid<-c(mean(cars$speed),mean(cars$dist))
Calculate the covariance matrix:
cov_mat<-cov(cars)
Calculate the inverse of the covariance matrix:
inv_cov_mat<-solve(cov_mat)
Create a NULL variable, which will hold each of our calculated distances:
all_distances<-NULL
We can loop through each observation and find the Mahalanobis distance between them and the centroid of the data:
k<-1 while(k<=nrow(cars)){ the_distance<-cars[k,]-centroid mahalanobis_dist<-t(matrix(as.numeric(the_distance)))%*% matrix(inv_cov_mat,nrow=2) %*% matrix(as.numeric(the_distance)) all_distances<-c(all_distances,mahalanobis_dist) k<-k+1 }
Plot all observations that have particularly high Mahalanobis distances to see our outliers:
plot(cars) points(cars$speed[which(all_distances>quantile(all_distances,.9))], cars$dist[which(all_distances>quantile(all_distances,.9))],col='red',pch=19)
We can see the output plot as follows, with the outlier points shown in red: