Packt+ | Advance your knowledge in tech

You're reading from Applied Unsupervised Learning with R Uncover hidden relationships and patterns with k-means clustering, hierarchical clustering, and PCA

Product type Paperback

Published in Mar 2019

Publisher

ISBN-13 9781789956399

Length 320 pages

Edition 1st Edition

Languages

Concepts

Machine Learning

Authors (2):

Bradford Tuckfield

Alok Malik

View More author details

Table of Contents (9) Chapters

Applied Unsupervised Learning with R

Preface

1. Introduction to Clustering Methods

2. Advanced Clustering Methods FREE CHAPTER

3. Probability Distributions

4. Dimension Reduction

5. Data Comparison Methods

6. Anomaly Detection

Appendix

Chapter 4: Dimension Reduction

Activity 10: Performing PCA and Market Basket Analysis on a New Dataset

Solution:

Before starting our main analysis, we will remove one variable that will not be relevant to us:
```
Boston<-Boston[,-12]
```
We will create dummy variables. We will end up with one original dataset, and one dummy variable dataset. We do that as follows:
```
Boston_original<-Boston
```
Next, we will create dummy variables for each of the measurements in the original dataset. You can find out the meaning of each of the variables in the dataset in the documentation of the MASS package, available at https://cran.r-project.org/web/packages/MASS/MASS.pdf.

Create dummy variables for whether a town has high or low crime per capita:

Boston$highcrim<-1*(Boston$indus>median(Boston$crim))
Boston$lowcrim<-1*(Boston$indus<=median(Boston$crim))

Create dummy variables for whether a town has a high or low proportion of land zoned for lots over 25,000 feet:

Boston$highzn<-1*(Boston$zn>median(Boston$zn))
Boston$lowzn<-1*(Boston$zn<=median(Boston$zn))

Create dummy variables for whether a town has a high or low proportion of non-retail business acres per town:

Boston$highindus<-1*(Boston$indus>median(Boston$indus))
Boston$lowindus<-1*(Boston$indus<=median(Boston$indus))

Create dummy variables for whether a town borders the Charles River:

Boston$highchas<-(Boston$chas)
Boston$lowchas<-(1-Boston$chas)

Create dummy variables for whether a town has a high or low nitrogen oxide concentration:

Boston$highnox<-1*(Boston$nox>median(Boston$nox))
Boston$lownox<-1*(Boston$nox<=median(Boston$nox))

Create dummy variables for whether a town has a high or low average number of rooms per dwelling:

Boston$highrm<-1*(Boston$rm>median(Boston$rm))
Boston$lowrm<-1*(Boston$rm<=median(Boston$rm))

Create dummy variables for whether a town has a high or low proportion of owner-occupied units built prior to 1940:

Boston$highage<-1*(Boston$age>median(Boston$age))
Boston$lowage<-1*(Boston$age<=median(Boston$age))

Create dummy variables for whether a town has a high or low average distance to five of Boston's employment centers:

Boston$highdis<-1*(Boston$dis>median(Boston$dis))
Boston$lowdis<-1*(Boston$dis<=median(Boston$dis))

Create dummy variables for whether a town has a high or low index of accessibility to radial highways:

Boston$highrad<-1*(Boston$rad>median(Boston$rad))
Boston$lowrad<-1*(Boston$rad<=median(Boston$rad))

Create dummy variables for whether a town has a high or low full-value property tax rate:

Boston$hightax<-1*(Boston$tax>median(Boston$tax))
Boston$lowtax<-1*(Boston$tax<=median(Boston$tax))

Create dummy variables for whether a town has a high or low pupil-teacher ratio:

Boston$highptratio<-1*(Boston$ptratio>median(Boston$ptratio))
Boston$lowptratio<-1*(Boston$ptratio<=median(Boston$ptratio))

Create dummy variables for whether a town has a high or low proportion of lower-status population:

Boston$highlstat<-1*(Boston$lstat>median(Boston$lstat))
Boston$lowlstat<-1*(Boston$lstat<=median(Boston$lstat))

Create dummy variables for whether a town has a high or low median home value:

Boston$highmedv<-1*(Boston$medv>median(Boston$medv))
Boston$lowmedv<-1*(Boston$medv<=median(Boston$medv))

Create a dataset that consists entirely of the dummy variables we have just created:
```
Bostondummy<-Boston[,14:ncol(Boston)]
```
Finally, we will restore our Boston_2 dataset to its original form before all of the dummy variables were added:
```
Boston<-Boston_original
```
Calculate the eigenvalues and eigenvectors of the covariance matrix of the dataset, as follows:
```
Boston_cov<-cov(Boston)
Boston_eigen<-eigen(Boston_cov)
print(Boston_eigen$vectors)
```
The output is as follows:
Figure 4.17: Eigenvectors of the covariance matrix
Print eigen values as follows:
```
print(Boston_eigen$values)
```
The output is as follows:
Figure 4.18: Eigenvalues of the covariance matrix
For the third part, we create a simple scree plot based on the eigenvalues:
```
plot(Boston_eigen$values,type='o')
```
The output is as follows:
Figure 4.19: Plot of the eigenvalues
Next, we choose the number of eigenvectors we will use (I chose 10), and we transform the dataset to be 10-dimensional, as follows:
```
neigen<-10
transformed<-t(t(as.matrix(Boston_eigen$vectors[,1:neigen])) %*% t(as.matrix(Boston)))
```

Then, we restore the dataset as much as possible:

restored<- t(as.matrix(Boston_eigen$vectors[,1:neigen]) %*% t(as.matrix(transformed)))

Finally, we can check how close our restoration is to the original dataset, as follows:
```
print(head(restored-Boston))
```
Here, we need to specify a support threshold (for example, 20%), and complete the first pass through the data:
```
support_thresh<-0.2
firstpass<-unname(which(colMeans(Bostondummy,na.rm=TRUE)>support_thresh))
```

Here, we complete the second pass through the data:

secondcand<-t(combn(firstpass,2))
secondpass<-NULL
k<-1
while(k<=nrow(secondcand)){
support<-mean(Bostondummy[,secondcand[k,1]]*Bostondummy[,secondcand[k,2]],na.rm=TRUE)
if(support>support_thresh){
secondpass<-rbind(secondpass,secondcand[k,])
}
k<-k+1
}

Here, we complete the third pass, and then do filtering based on the confidence and lift thresholds:

thirdpass<-NULL
k<-1
while(k<=nrow(secondpass)){
j<-1
while(j<=length(firstpass)){
n<-1
product<-1
while(n<=ncol(secondpass)){
product<-product*Bostondummy[,secondpass[k,n]]
n<-n+1
}
if(!(firstpass[j] %in% secondpass[k,])){
product<-product*Bostondummy[,firstpass[j]]
support<-mean(product,na.rm=TRUE)
if(support>support_thresh){
thirdpass<-rbind(thirdpass,c(secondpass[k,],firstpass[j]))
}
}
j<-j+1
}
k<-k+1
}


thirdpass_conf<-NULL
k<-1
while(k<=nrow(thirdpass)){

support<-mean(Bostondummy[,thirdpass[k,1]]*Bostondummy[,thirdpass[k,2]]*Bostondummy[,thirdpass[k,3]],na.rm=TRUE)
confidence<-mean(Bostondummy[,thirdpass[k,1]]*Bostondummy[,thirdpass[k,2]]*Bostondummy[,thirdpass[k,3]],na.rm=TRUE)/mean(Bostondummy[,thirdpass[k,1]]*Bostondummy[,thirdpass[k,2]],na.rm=TRUE)

lift<-confidence/mean(Bostondummy[,thirdpass[k,3]],na.rm=TRUE)

thirdpass_conf<-rbind(thirdpass_conf,unname(c(thirdpass[k,],support,confidence,lift)))
k<-k+1
}

Our final output is the list of three-item baskets that have passed the support, confidence, and lift thresholds:
```
print(head(thirdpass_conf))
```
The output is as follows:
Figure 4.20: Output of the three-item basket