Packt+ | Advance your knowledge in tech

You're reading from Applied Supervised Learning with R Use machine learning libraries of R to build models that solve business problems and predict future trends

Product type Paperback

Published in May 2019

Publisher

ISBN-13 9781838556334

Length 502 pages

Edition 1st Edition

Languages

Concepts

Machine Learning

Authors (2):

Jojo Moolayil

Karthik Ramasubramanian

View More author details

Table of Contents (12) Chapters

Applied Supervised Learning with R

Preface

1. R for Advanced Analytics FREE CHAPTER

2. Exploratory Analysis of Data

3. Introduction to Supervised Learning

4. Regression

5. Classification

6. Feature Selection and Dimensionality Reduction

7. Model Improvements

8. Model Deployment

9. Capstone Project - Based on Research Papers

Appendix

Chapter 3: Introduction to Supervised Learning

Activity 5: Draw a Scatterplot between PRES and PM2.5 Split by Months

Import the ggplot2 package into the system:
```
library(ggplot2)
```

In ggplot, assign the component of the a() method with the variable PRES.

ggplot(data = PM25, aes(x = PRES, y = pm2.5, color = hour)) +   geom_point()

In the next layer of the geom_smooth() method, passing colour = "blue" to differentiate.
```
geom_smooth(method='auto',formula=y~x, colour = "blue", size =1)
```
Finally, in the facet_wrap() layer, use the month variable to draw a separate segregation for each month.
```
facet_wrap(~ month, nrow = 4)
```
The final code will look like this:
```
ggplot(data = PM25, aes(x = PRES, y = pm2.5, color = hour)) +geom_point() +geom_smooth(method='auto',formula=y~x, colour = "blue", size =1) +facet_wrap(~ month, nrow = 4)
```
The plot is as follows:
Figure 3.19: Scatterplot showing the relationship between PRES and PM2.5

Activity 6: Transforming Variables and Deriving New Variables to Build a Model

Perform the following steps for building the model:

Import the required libraries and packages into the system:

library(dplyr)
library(lubridate)
library(tidyr)
library(ggplot2)
library(grid)
library(zoo)

Combine the year, month, day, and hour into a datetime variable:

PM25$datetime <- with(PM25, ymd_h(sprintf('%04d%02d%02d%02d', year, month, day,hour)))

Remove the rows with missing values in any column:

PM25_subset <- na.omit(PM25[,c("datetime","pm2.5")])

Use the rollapply() method from the package zoo to compute the moving average of PM2.5; this is to smoothen any noise from a reading of PM2.5:
```
PM25_three_hour_pm25_avg <- rollapply(zoo(PM25_subset$pm2.5,PM25_subset$datetime), 3, mean)
```

Create two levels of the PM25 pollution, 0–Normal, 1-Above Normal. We can also create more than two levels; however, for logistic regression, which works best with binary classification, we have used two levels:

PM25_three_hour_pm25_avg <- as.data.frame(PM25_three_hour_pm25_avg)
PM25_three_hour_pm25_avg$timestamp <- row.names(PM25_three_hour_pm25_avg)
row.names(PM25_three_hour_pm25_avg) <- NULL
colnames(PM25_three_hour_pm25_avg) <- c("avg_pm25","timestamp")
PM25_three_hour_pm25_avg$pollution_level <- ifelse(PM25_three_hour_pm25_avg$avg_pm25 <= 35, 0,1)
PM25_three_hour_pm25_avg$timestamp <- as.POSIXct(PM25_three_hour_pm25_avg$timestamp, format= "%Y-%m-%d %H:%M:%S",tz="GMT")

Merge the resulting data frame (PM25_three_hour_pm25_avg ) with the values of other environmental variables such as TEMP, DEWP, and Iws, which we used in the linear regression model:
```
PM25_for_class <- merge(PM25_three_hour_pm25_avg, PM25[,c("datetime","TEMP","DEWP","PRES","Iws","cbwd","Is","Ir")], by.x = "timestamp",by.y = "datetime")
```

Fit the generalized linear model (glm) on pollution_level using the TEMP, DEWP and Iws variables:

PM25_logit_model <- glm(pollution_level ~ DEWP + TEMP + Iws, data = PM25_for_class,family=binomial(link='logit'))

Summarize the model:

summary(PM25_logit_model)

The output is as follows:

Call:
glm(formula = pollution_level ~ DEWP + TEMP + Iws, family = binomial(link = "logit"), 
    data = PM25_for_class)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4699  -0.5212   0.4569   0.6508   3.5824  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.5240276  0.0273353   92.34   <2e-16 ***
DEWP         0.1231959  0.0016856   73.09   <2e-16 ***
TEMP        -0.1028211  0.0018447  -55.74   <2e-16 ***
Iws         -0.0127037  0.0003535  -35.94   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 49475  on 41754  degrees of freedom
Residual deviance: 37821  on 41751  degrees of freedom
AIC: 37829

Number of Fisher Scoring iterations: 5