Packt+ | Advance your knowledge in tech

You're reading from Practical Machine Learning with R Define, build, and evaluate machine learning models for real-world applications

Product type Paperback

Published in Aug 2019

Publisher Packt

ISBN-13 9781838550134

Length 416 pages

Edition 1st Edition

Languages

Tools

RStudio

Concepts

Machine Learning

Authors (3):

Brindha Priyadarshini Jeyaraman

Ludvig Renbo Olsen

Monicah Wambugu

View More author details

Table of Contents (8) Chapters

About the Book

1. An Introduction to Machine Learning FREE CHAPTER

2. Data Cleaning and Pre-processing

3. Feature Engineering

4. Introduction to neuralnet and Evaluation Methods

5. Linear and Logistic Regression Models

6. Unsupervised Learning

1. Appendix

Regression

In this section, we will cover linear regression with single and multiple variables. Let's implement a linear regression model in R. We will predict the median value of an owner-occupied house in the Boston Housing dataset.

The Boston Housing dataset contains the following fields:

Figure 1.34: Boston Housing dataset fields

Here is a model for the indus field.

#Build a simple linear regression

model1 <- lm(medv~indus, data = BostonHousing)

#summary(model1)

AIC(model1)

The output is as follows:

[1] 3551.601

Build a model considering the age and dis fields:

model2 = lm(medv ~ age + dis, BostonHousing)

summary(model2)

AIC(model2)

Call:

lm(formula = medv ~ age + dis, data = BostonHousing)

Residuals:

Min 1Q Median 3Q Max

-15.661 -5.145 -1.900 2.173 31.114

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 33.3982 2.2991 14.526 < 2e-16 ***

age -0.1409 0.0203 -6.941 1.2e-11 ***

dis -0.3170 0.2714 -1.168 0.243

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.524 on 503 degrees of freedom

Multiple R-squared: 0.1444, Adjusted R-squared: 0.141

F-statistic: 42.45 on 2 and 503 DF, p-value: < 2.2e-16

The output is as follows:

[1] 3609.558

AIC is the Akaike information criterion, denoting that the lower the value, the better the model performance. Therefore, the performance of model1 is superior to that of model2.

In a linear regression, it is important to find the distance between the actual output values and the predicted values. To calculate RMSE, we will find the square root of the mean of the squared error using sqrt(sum(error^2)/n).

We have learned to build various regression models with single or multiple fields in the preceding example.

Another type of supervised learning is classification. In the next exercise we will build a simple linear classifier, to see how similar that process is to the fitting of linear regression models. After that, you will dive into building more regression models in the activities.

Exercise 5: Building a Linear Classifier in R

In this exercise, we will build a linear classifier for the GermanCredit dataset using a linear discriminant analysis model.

The German Credit dataset contains the credit-worthiness of a customer (whether the customer is 'good' or 'bad' based on their credit history), account details, and so on. The dataset can be found at https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/GermanCredit.csv.

Load the dataset:
# load the package
library(caret)
data(GermanCredit)
#OR
#GermanCredit <-read.csv("GermanCredit.csv")
Subset the dataset:
#Subset the data
GermanCredit_Subset=GermanCredit[,1:10]
Find the fit model:
# fit model
fit <- lda(Class~., data=GermanCredit_Subset)
Summarize the fit:
# summarize the fit
summary(fit)
The output is as follows:
Length Class  Mode
prior    2     -none- numeric
counts   2     -none- numeric
means   18     -none- numeric
scaling  9     -none- numeric
lev      2     -none- character
svd      1     -none- numeric
N        1     -none- numeric
call     3     -none- call
terms    3     terms  call
xlevels  0     -none- list
Make predictions.
# make predictions
predictions <- predict(fit, GermanCredit_Subset[,1:10],allow.new.levels=TRUE)$class
Calculate the accuracy of the model:
# summarize accuracy
accuracy <- mean(predictions == GermanCredit_Subset$Class)
Print accuracy:
accuracy
The output is as follows:
[1] 0.71

In this exercise, we have trained a linear classifier to predict the credit rating of customers with an accuracy of 71%. In chapter 4, Introduction to neuralnet and Evaluation Methods, we will try to beat that accuracy, and investigate whether 71% is actually a good accuracy for the given dataset.