Packt+ | Advance your knowledge in tech

You're reading from Practical Machine Learning with R Define, build, and evaluate machine learning models for real-world applications

Product type Paperback

Published in Aug 2019

Publisher Packt

ISBN-13 9781838550134

Length 416 pages

Edition 1st Edition

Languages

Tools

RStudio

Concepts

Machine Learning

Authors (3):

Brindha Priyadarshini Jeyaraman

Ludvig Renbo Olsen

Monicah Wambugu

View More author details

Table of Contents (8) Chapters

About the Book

1. An Introduction to Machine Learning FREE CHAPTER

2. Data Cleaning and Pre-processing

3. Feature Engineering

4. Introduction to neuralnet and Evaluation Methods

5. Linear and Logistic Regression Models

6. Unsupervised Learning

1. Appendix

Chapter 3: Feature Engineering

Activity 10: Calculating Time series Feature – Binning

Solution:

Load the caret library:
#Time series features
library(caret)
#Install caret if not installed
#install.packages('caret')
Load the GermanCredit dataset:
GermanCredit = read.csv("GermanCredit.csv")
duration<- GermanCredit$Duration #take the duration column
Check the data summary as follows:
summary(duration)
The output is as follows:
Figure 3.27: The summary of the Duration values of German Credit dataset
Load the ggplot2 library:
library(ggplot2)
Plot using the command:
ggplot(data=GermanCredit, aes(x=GermanCredit$Duration)) +
  geom_density(fill='lightblue') +
  geom_rug() +
  labs(x='mean Duration')
The output is as follows:

Figure 3.28: Plot of the duration vs density
Create bins:
#Creating Bins
# set up boundaries for intervals/bins
breaks <- c(0,10,20,30,40,50,60,70,80)
Create labels:
# specify interval/bin labels
labels <- c("<10", "10-20", "20-30", "30-40", "40-50", "50-60", "60-70", "70-80")
Bucket the datapoints into the bins.
# bucketing data points into bins
bins <- cut(duration, breaks, include.lowest = T, right=FALSE, labels=labels)
Find the number of elements in each bin:
# inspect bins
summary(bins)
The output is as follows:
summary(bins)
<10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
143 403 241 131 66 2 13 1
Plot the bins:
#Ploting the bins
plot(bins, main="Frequency of Duration", ylab="Duration Count", xlab="Duration Bins",col="bisque")
The output is as follows:

Figure 3.29: Plot of duration in bins

We can conclude that the maximum number of customers are within the range of 10 to 20.

Activity 11: Identifying Skewness

Solution:

Load the library mlbench.
#Skewness
library(mlbench)
library(e1071)
Load the PrimaIndainsDiabetes data.
PimaIndiansDiabetes = read.csv("PimaIndiansDiabetes.csv")
Print the skewness of the glucose column, using the skewness() function.
#Printing the skewness of the columns
#Not skewed
skewness(PimaIndiansDiabetes$glucose)
The output is as follows:
[1] 0.1730754
Plot the histogram using the histogram() function.
histogram(PimaIndiansDiabetes$glucose)
The output is as follows:
Figure 3.30: Histogram of the glucose values of the PrimaIndainsGlucose dataset
A negative skewness value means that the data is skewed to the left and a positive skewness value means that the data is skewed to the right. Since the value here is 0.17, the data is neither completely left or right skewed. Therefore, it is not skewed.
Find the skewness of the age column using the skewness() function.
#Highly skewed
skewness(PimaIndiansDiabetes$age)
The output is as follows:
[1] 1.125188
Plot the histogram using the histogram() function.
histogram(PimaIndiansDiabetes$age)
The output is as follows:

Figure 3.31: Histogram of the age values of the PrimaIndiansDiabetes dataset

The positive skewness value means that it is skewed to the right as we can see above.

Activity 12: Generating PCA

Solution:

Load the GermanCredit data.
#PCA Analysis
data(GermanCredit)
Create a subset of first 9 columns into another variable names GermanCredit_subset
#Use the German Credit Data
GermanCredit_subset <- GermanCredit[,1:9]
Find the principal components:
#Find out the Principal components
principal_components <- prcomp(x = GermanCredit_subset, scale. = T)
Print the principal components:
#Print the principal components
print(principal_components)
The output is as follows:
Standard deviations (1, .., p=9):
[1] 1.3505916 1.2008442 1.1084157 0.9721503 0.9459586
0.9317018 0.9106746 0.8345178 0.5211137
Rotation (n x k) = (9 x 9):

Figure 3.32: Histogram of the age values of the PrimaIndiansDiabetes dataset

Therefore, by using principal component analysis we can identify the top nine principal components in the dataset. These components are calculated from multiple fields and they can be used as features on their own.

Activity 13: Implementing the Random Forest Approach

Solution:

Load the GermanCredit data:
data(GermanCredit)
Create a subset to load the first ten columns into GermanCredit_subset.
GermanCredit_subset <- GermanCredit[,1:10]
Attach the randomForest package:
library(randomForest)
Train a random forest model using random_forest =randomForest(Class~., data=GermanCredit_subset):
random_forest = randomForest(Class~., data=GermanCredit_subset)
Invoke importance() for the trained random_forest:
# Create an importance based on mean decreasing gini
importance(random_forest)
The output is as follows:
importance(random_forest)
                          MeanDecreaseGini
Duration                         70.380265
Amount                          121.458790
InstallmentRatePercentage        27.048517
ResidenceDuration                30.409254
Age                              86.476017
NumberExistingCredits            18.746057
NumberPeopleMaintenance          12.026969
Telephone                        15.581802
ForeignWorker                     2.888387
Use the varImp() function to view the list of important variables.
varImp(random_forest)
The output is as follows:
                             Overall
Duration                   70.380265
Amount                    121.458790
InstallmentRatePercentage  27.048517
ResidenceDuration          30.409254
Age                        86.476017
NumberExistingCredits      18.746057
NumberPeopleMaintenance    12.026969
Telephone                  15.581802
ForeignWorker               2.888387
In this activity, we built a random forest model and used it to see the importance of each variable in a dataset. The variables with higher scores are considered more important. Having done this, we can sort by importance and choose the top 5 or top 10 for the model or set a threshold for importance and choose all the variables that meet the threshold.

Activity 14: Selecting Features Using Variable Importance

Solution:

Install the following packages:
install.packages("rpart")
library(rpart)
library(caret)
set.seed(10)
Load the GermanCredit dataset:
data(GermanCredit)
Create a subset to load the first ten columns into GermanCredit_subset:
GermanCredit_subset <- GermanCredit[,1:10]
Train an rpart model using rPartMod <- train(Class ~ ., data=GermanCredit_subset, method="rpart"):
#Train a rpart model
rPartMod <- train(Class ~ ., data=GermanCredit_subset, method="rpart")
Invoke the varImp() function, as in rpartImp <- varImp(rPartMod).
#Find variable importance
rpartImp <- varImp(rPartMod)
Print rpartImp.
#Print variable importance
print(rpartImp)
The output is as follows:
rpart variable importance
                          Overall
Amount                    100.000
Duration                   89.670
Age                        75.229
ForeignWorker              22.055
InstallmentRatePercentage  17.288
Telephone                   7.813
ResidenceDuration           4.471
NumberExistingCredits       0.000
NumberPeopleMaintenance     0.000
Plot rpartImp using plot().
#Plot top 5 variable importance
plot(rpartImp, top = 5, main='Variable Importance')
The output is as follows: