Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Applied Supervised Learning with R

You're reading from   Applied Supervised Learning with R Use machine learning libraries of R to build models that solve business problems and predict future trends

Arrow left icon
Product type Paperback
Published in May 2019
Publisher
ISBN-13 9781838556334
Length 502 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Jojo Moolayil Jojo Moolayil
Author Profile Icon Jojo Moolayil
Jojo Moolayil
Karthik Ramasubramanian Karthik Ramasubramanian
Author Profile Icon Karthik Ramasubramanian
Karthik Ramasubramanian
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Applied Supervised Learning with R
Preface
1. R for Advanced Analytics FREE CHAPTER 2. Exploratory Analysis of Data 3. Introduction to Supervised Learning 4. Regression 5. Classification 6. Feature Selection and Dimensionality Reduction 7. Model Improvements 8. Model Deployment 9. Capstone Project - Based on Research Papers Appendix

Chapter 3: Introduction to Supervised Learning


Activity 5: Draw a Scatterplot between PRES and PM2.5 Split by Months

  1. Import the ggplot2 package into the system:

    library(ggplot2)
  2. In ggplot, assign the component of the a() method with the variable PRES.

    ggplot(data = PM25, aes(x = PRES, y = pm2.5, color = hour)) +   geom_point()
  3. In the next layer of the geom_smooth() method, passing colour = "blue" to differentiate.

    geom_smooth(method='auto',formula=y~x, colour = "blue", size =1)
  4. Finally, in the facet_wrap() layer, use the month variable to draw a separate segregation for each month.

    facet_wrap(~ month, nrow = 4)

    The final code will look like this:

    ggplot(data = PM25, aes(x = PRES, y = pm2.5, color = hour)) +geom_point() +geom_smooth(method='auto',formula=y~x, colour = "blue", size =1) +facet_wrap(~ month, nrow = 4)

    The plot is as follows:

    Figure 3.19: Scatterplot showing the relationship between PRES and PM2.5

Activity 6: Transforming Variables and Deriving New Variables to Build a Model

Perform the following steps for building the model:

  1. Import the required libraries and packages into the system:

    library(dplyr)
    library(lubridate)
    library(tidyr)
    library(ggplot2)
    library(grid)
    library(zoo)
  2. Combine the year, month, day, and hour into a datetime variable:

    PM25$datetime <- with(PM25, ymd_h(sprintf('%04d%02d%02d%02d', year, month, day,hour)))
  3. Remove the rows with missing values in any column:

    PM25_subset <- na.omit(PM25[,c("datetime","pm2.5")])
  4. Use the rollapply() method from the package zoo to compute the moving average of PM2.5; this is to smoothen any noise from a reading of PM2.5:

    PM25_three_hour_pm25_avg <- rollapply(zoo(PM25_subset$pm2.5,PM25_subset$datetime), 3, mean)
  5. Create two levels of the PM25 pollution, 0–Normal, 1-Above Normal. We can also create more than two levels; however, for logistic regression, which works best with binary classification, we have used two levels:

    PM25_three_hour_pm25_avg <- as.data.frame(PM25_three_hour_pm25_avg)
    PM25_three_hour_pm25_avg$timestamp <- row.names(PM25_three_hour_pm25_avg)
    row.names(PM25_three_hour_pm25_avg) <- NULL
    colnames(PM25_three_hour_pm25_avg) <- c("avg_pm25","timestamp")
    PM25_three_hour_pm25_avg$pollution_level <- ifelse(PM25_three_hour_pm25_avg$avg_pm25 <= 35, 0,1)
    PM25_three_hour_pm25_avg$timestamp <- as.POSIXct(PM25_three_hour_pm25_avg$timestamp, format= "%Y-%m-%d %H:%M:%S",tz="GMT")
  6. Merge the resulting data frame (PM25_three_hour_pm25_avg ) with the values of other environmental variables such as TEMP, DEWP, and Iws, which we used in the linear regression model:

    PM25_for_class <- merge(PM25_three_hour_pm25_avg, PM25[,c("datetime","TEMP","DEWP","PRES","Iws","cbwd","Is","Ir")], by.x = "timestamp",by.y = "datetime")
  7. Fit the generalized linear model (glm) on pollution_level using the TEMP, DEWP and Iws variables:

    PM25_logit_model <- glm(pollution_level ~ DEWP + TEMP + Iws, data = PM25_for_class,family=binomial(link='logit'))
  8. Summarize the model:

    summary(PM25_logit_model)

    The output is as follows:

    Call:
    glm(formula = pollution_level ~ DEWP + TEMP + Iws, family = binomial(link = "logit"), 
        data = PM25_for_class)
    
    Deviance Residuals: 
        Min       1Q   Median       3Q      Max  
    -2.4699  -0.5212   0.4569   0.6508   3.5824  
    
    Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  2.5240276  0.0273353   92.34   <2e-16 ***
    DEWP         0.1231959  0.0016856   73.09   <2e-16 ***
    TEMP        -0.1028211  0.0018447  -55.74   <2e-16 ***
    Iws         -0.0127037  0.0003535  -35.94   <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 49475  on 41754  degrees of freedom
    Residual deviance: 37821  on 41751  degrees of freedom
    AIC: 37829
    
    Number of Fisher Scoring iterations: 5
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime