Chapter 4: Dimension Reduction
Activity 10: Performing PCA and Market Basket Analysis on a New Dataset
Solution:
Before starting our main analysis, we will remove one variable that will not be relevant to us:
Boston<-Boston[,-12]
We will create dummy variables. We will end up with one original dataset, and one dummy variable dataset. We do that as follows:
Boston_original<-Boston
Next, we will create dummy variables for each of the measurements in the original dataset. You can find out the meaning of each of the variables in the dataset in the documentation of the MASS package, available at https://cran.r-project.org/web/packages/MASS/MASS.pdf.
Create dummy variables for whether a town has high or low crime per capita:
Boston$highcrim<-1*(Boston$indus>median(Boston$crim)) Boston$lowcrim<-1*(Boston$indus<=median(Boston$crim))
Create dummy variables for whether a town has a high or low proportion of land zoned for lots over 25,000 feet:
Boston$highzn<-1*(Boston$zn>median(Boston$zn)) Boston$lowzn<-1*(Boston$zn<=median(Boston$zn))
Create dummy variables for whether a town has a high or low proportion of non-retail business acres per town:
Boston$highindus<-1*(Boston$indus>median(Boston$indus)) Boston$lowindus<-1*(Boston$indus<=median(Boston$indus))
Create dummy variables for whether a town borders the Charles River:
Boston$highchas<-(Boston$chas) Boston$lowchas<-(1-Boston$chas)
Create dummy variables for whether a town has a high or low nitrogen oxide concentration:
Boston$highnox<-1*(Boston$nox>median(Boston$nox)) Boston$lownox<-1*(Boston$nox<=median(Boston$nox))
Create dummy variables for whether a town has a high or low average number of rooms per dwelling:
Boston$highrm<-1*(Boston$rm>median(Boston$rm)) Boston$lowrm<-1*(Boston$rm<=median(Boston$rm))
Create dummy variables for whether a town has a high or low proportion of owner-occupied units built prior to 1940:
Boston$highage<-1*(Boston$age>median(Boston$age)) Boston$lowage<-1*(Boston$age<=median(Boston$age))
Create dummy variables for whether a town has a high or low average distance to five of Boston's employment centers:
Boston$highdis<-1*(Boston$dis>median(Boston$dis)) Boston$lowdis<-1*(Boston$dis<=median(Boston$dis))
Create dummy variables for whether a town has a high or low index of accessibility to radial highways:
Boston$highrad<-1*(Boston$rad>median(Boston$rad)) Boston$lowrad<-1*(Boston$rad<=median(Boston$rad))
Create dummy variables for whether a town has a high or low full-value property tax rate:
Boston$hightax<-1*(Boston$tax>median(Boston$tax)) Boston$lowtax<-1*(Boston$tax<=median(Boston$tax))
Create dummy variables for whether a town has a high or low pupil-teacher ratio:
Boston$highptratio<-1*(Boston$ptratio>median(Boston$ptratio)) Boston$lowptratio<-1*(Boston$ptratio<=median(Boston$ptratio))
Create dummy variables for whether a town has a high or low proportion of lower-status population:
Boston$highlstat<-1*(Boston$lstat>median(Boston$lstat)) Boston$lowlstat<-1*(Boston$lstat<=median(Boston$lstat))
Create dummy variables for whether a town has a high or low median home value:
Boston$highmedv<-1*(Boston$medv>median(Boston$medv)) Boston$lowmedv<-1*(Boston$medv<=median(Boston$medv))
Create a dataset that consists entirely of the dummy variables we have just created:
Bostondummy<-Boston[,14:ncol(Boston)]
Finally, we will restore our Boston_2 dataset to its original form before all of the dummy variables were added:
Boston<-Boston_original
Calculate the eigenvalues and eigenvectors of the covariance matrix of the dataset, as follows:
Boston_cov<-cov(Boston) Boston_eigen<-eigen(Boston_cov) print(Boston_eigen$vectors)
The output is as follows:
Print eigen values as follows:
print(Boston_eigen$values)
The output is as follows:
For the third part, we create a simple scree plot based on the eigenvalues:
plot(Boston_eigen$values,type='o')
The output is as follows:
Next, we choose the number of eigenvectors we will use (I chose 10), and we transform the dataset to be 10-dimensional, as follows:
neigen<-10 transformed<-t(t(as.matrix(Boston_eigen$vectors[,1:neigen])) %*% t(as.matrix(Boston)))
Then, we restore the dataset as much as possible:
restored<- t(as.matrix(Boston_eigen$vectors[,1:neigen]) %*% t(as.matrix(transformed)))
Finally, we can check how close our restoration is to the original dataset, as follows:
print(head(restored-Boston))
Here, we need to specify a support threshold (for example, 20%), and complete the first pass through the data:
support_thresh<-0.2 firstpass<-unname(which(colMeans(Bostondummy,na.rm=TRUE)>support_thresh))
Here, we complete the second pass through the data:
secondcand<-t(combn(firstpass,2)) secondpass<-NULL k<-1 while(k<=nrow(secondcand)){ support<-mean(Bostondummy[,secondcand[k,1]]*Bostondummy[,secondcand[k,2]],na.rm=TRUE) if(support>support_thresh){ secondpass<-rbind(secondpass,secondcand[k,]) } k<-k+1 }
Here, we complete the third pass, and then do filtering based on the confidence and lift thresholds:
thirdpass<-NULL k<-1 while(k<=nrow(secondpass)){ j<-1 while(j<=length(firstpass)){ n<-1 product<-1 while(n<=ncol(secondpass)){ product<-product*Bostondummy[,secondpass[k,n]] n<-n+1 } if(!(firstpass[j] %in% secondpass[k,])){ product<-product*Bostondummy[,firstpass[j]] support<-mean(product,na.rm=TRUE) if(support>support_thresh){ thirdpass<-rbind(thirdpass,c(secondpass[k,],firstpass[j])) } } j<-j+1 } k<-k+1 } thirdpass_conf<-NULL k<-1 while(k<=nrow(thirdpass)){ support<-mean(Bostondummy[,thirdpass[k,1]]*Bostondummy[,thirdpass[k,2]]*Bostondummy[,thirdpass[k,3]],na.rm=TRUE) confidence<-mean(Bostondummy[,thirdpass[k,1]]*Bostondummy[,thirdpass[k,2]]*Bostondummy[,thirdpass[k,3]],na.rm=TRUE)/mean(Bostondummy[,thirdpass[k,1]]*Bostondummy[,thirdpass[k,2]],na.rm=TRUE) lift<-confidence/mean(Bostondummy[,thirdpass[k,3]],na.rm=TRUE) thirdpass_conf<-rbind(thirdpass_conf,unname(c(thirdpass[k,],support,confidence,lift))) k<-k+1 }
Our final output is the list of three-item baskets that have passed the support, confidence, and lift thresholds:
print(head(thirdpass_conf))
The output is as follows: