Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Applied Unsupervised Learning with R

You're reading from   Applied Unsupervised Learning with R Uncover hidden relationships and patterns with k-means clustering, hierarchical clustering, and PCA

Arrow left icon
Product type Paperback
Published in Mar 2019
Publisher
ISBN-13 9781789956399
Length 320 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Bradford Tuckfield Bradford Tuckfield
Author Profile Icon Bradford Tuckfield
Bradford Tuckfield
Alok Malik Alok Malik
Author Profile Icon Alok Malik
Alok Malik
Arrow right icon
View More author details
Toc

Chapter 4: Dimension Reduction


Activity 10: Performing PCA and Market Basket Analysis on a New Dataset

Solution:

  1. Before starting our main analysis, we will remove one variable that will not be relevant to us:

    Boston<-Boston[,-12]
  2. We will create dummy variables. We will end up with one original dataset, and one dummy variable dataset. We do that as follows:

    Boston_original<-Boston

    Next, we will create dummy variables for each of the measurements in the original dataset. You can find out the meaning of each of the variables in the dataset in the documentation of the MASS package, available at https://cran.r-project.org/web/packages/MASS/MASS.pdf.

  3. Create dummy variables for whether a town has high or low crime per capita:

    Boston$highcrim<-1*(Boston$indus>median(Boston$crim))
    Boston$lowcrim<-1*(Boston$indus<=median(Boston$crim))

    Create dummy variables for whether a town has a high or low proportion of land zoned for lots over 25,000 feet:

    Boston$highzn<-1*(Boston$zn>median(Boston$zn))
    Boston$lowzn<-1*(Boston$zn<=median(Boston$zn))

    Create dummy variables for whether a town has a high or low proportion of non-retail business acres per town:

    Boston$highindus<-1*(Boston$indus>median(Boston$indus))
    Boston$lowindus<-1*(Boston$indus<=median(Boston$indus))

    Create dummy variables for whether a town borders the Charles River:

    Boston$highchas<-(Boston$chas)
    Boston$lowchas<-(1-Boston$chas)

    Create dummy variables for whether a town has a high or low nitrogen oxide concentration:

    Boston$highnox<-1*(Boston$nox>median(Boston$nox))
    Boston$lownox<-1*(Boston$nox<=median(Boston$nox))

    Create dummy variables for whether a town has a high or low average number of rooms per dwelling:

    Boston$highrm<-1*(Boston$rm>median(Boston$rm))
    Boston$lowrm<-1*(Boston$rm<=median(Boston$rm))

    Create dummy variables for whether a town has a high or low proportion of owner-occupied units built prior to 1940:

    Boston$highage<-1*(Boston$age>median(Boston$age))
    Boston$lowage<-1*(Boston$age<=median(Boston$age))

    Create dummy variables for whether a town has a high or low average distance to five of Boston's employment centers:

    Boston$highdis<-1*(Boston$dis>median(Boston$dis))
    Boston$lowdis<-1*(Boston$dis<=median(Boston$dis))

    Create dummy variables for whether a town has a high or low index of accessibility to radial highways:

    Boston$highrad<-1*(Boston$rad>median(Boston$rad))
    Boston$lowrad<-1*(Boston$rad<=median(Boston$rad))

    Create dummy variables for whether a town has a high or low full-value property tax rate:

    Boston$hightax<-1*(Boston$tax>median(Boston$tax))
    Boston$lowtax<-1*(Boston$tax<=median(Boston$tax))

    Create dummy variables for whether a town has a high or low pupil-teacher ratio:

    Boston$highptratio<-1*(Boston$ptratio>median(Boston$ptratio))
    Boston$lowptratio<-1*(Boston$ptratio<=median(Boston$ptratio))

    Create dummy variables for whether a town has a high or low proportion of lower-status population:

    Boston$highlstat<-1*(Boston$lstat>median(Boston$lstat))
    Boston$lowlstat<-1*(Boston$lstat<=median(Boston$lstat))

    Create dummy variables for whether a town has a high or low median home value:

    Boston$highmedv<-1*(Boston$medv>median(Boston$medv))
    Boston$lowmedv<-1*(Boston$medv<=median(Boston$medv))
  4. Create a dataset that consists entirely of the dummy variables we have just created:

    Bostondummy<-Boston[,14:ncol(Boston)]
  5. Finally, we will restore our Boston_2 dataset to its original form before all of the dummy variables were added:

    Boston<-Boston_original
  6. Calculate the eigenvalues and eigenvectors of the covariance matrix of the dataset, as follows:

    Boston_cov<-cov(Boston)
    Boston_eigen<-eigen(Boston_cov)
    print(Boston_eigen$vectors)

    The output is as follows:

    Figure 4.17: Eigenvectors of the covariance matrix

  7. Print eigen values as follows:

    print(Boston_eigen$values)

    The output is as follows:

    Figure 4.18: Eigenvalues of the covariance matrix

  8. For the third part, we create a simple scree plot based on the eigenvalues:

    plot(Boston_eigen$values,type='o')

    The output is as follows:

    Figure 4.19: Plot of the eigenvalues

  9. Next, we choose the number of eigenvectors we will use (I chose 10), and we transform the dataset to be 10-dimensional, as follows:

    neigen<-10
    transformed<-t(t(as.matrix(Boston_eigen$vectors[,1:neigen])) %*% t(as.matrix(Boston)))
  10. Then, we restore the dataset as much as possible:

    restored<- t(as.matrix(Boston_eigen$vectors[,1:neigen]) %*% t(as.matrix(transformed)))
  11. Finally, we can check how close our restoration is to the original dataset, as follows:

    print(head(restored-Boston))
  12. Here, we need to specify a support threshold (for example, 20%), and complete the first pass through the data:

    support_thresh<-0.2
    firstpass<-unname(which(colMeans(Bostondummy,na.rm=TRUE)>support_thresh))
  13. Here, we complete the second pass through the data:

    secondcand<-t(combn(firstpass,2))
    secondpass<-NULL
    k<-1
    while(k<=nrow(secondcand)){
    support<-mean(Bostondummy[,secondcand[k,1]]*Bostondummy[,secondcand[k,2]],na.rm=TRUE)
    if(support>support_thresh){
    secondpass<-rbind(secondpass,secondcand[k,])
    }
    k<-k+1
    }
  14. Here, we complete the third pass, and then do filtering based on the confidence and lift thresholds:

    thirdpass<-NULL
    k<-1
    while(k<=nrow(secondpass)){
    j<-1
    while(j<=length(firstpass)){
    n<-1
    product<-1
    while(n<=ncol(secondpass)){
    product<-product*Bostondummy[,secondpass[k,n]]
    n<-n+1
    }
    if(!(firstpass[j] %in% secondpass[k,])){
    product<-product*Bostondummy[,firstpass[j]]
    support<-mean(product,na.rm=TRUE)
    if(support>support_thresh){
    thirdpass<-rbind(thirdpass,c(secondpass[k,],firstpass[j]))
    }
    }
    j<-j+1
    }
    k<-k+1
    }
    
    
    thirdpass_conf<-NULL
    k<-1
    while(k<=nrow(thirdpass)){
    
    support<-mean(Bostondummy[,thirdpass[k,1]]*Bostondummy[,thirdpass[k,2]]*Bostondummy[,thirdpass[k,3]],na.rm=TRUE)
    confidence<-mean(Bostondummy[,thirdpass[k,1]]*Bostondummy[,thirdpass[k,2]]*Bostondummy[,thirdpass[k,3]],na.rm=TRUE)/mean(Bostondummy[,thirdpass[k,1]]*Bostondummy[,thirdpass[k,2]],na.rm=TRUE)
    
    lift<-confidence/mean(Bostondummy[,thirdpass[k,3]],na.rm=TRUE)
    
    thirdpass_conf<-rbind(thirdpass_conf,unname(c(thirdpass[k,],support,confidence,lift)))
    k<-k+1
    }
  15. Our final output is the list of three-item baskets that have passed the support, confidence, and lift thresholds:

    print(head(thirdpass_conf))

    The output is as follows:

    Figure 4.20: Output of the three-item basket

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime