Chapter 3: Introduction to Supervised Learning
Activity 5: Draw a Scatterplot between PRES and PM2.5 Split by Months
Import the ggplot2 package into the system:
library(ggplot2)
In ggplot, assign the component of the a() method with the variable PRES.
ggplot(data = PM25, aes(x = PRES, y = pm2.5, color = hour)) + geom_point()
In the next layer of the geom_smooth() method, passing colour = "blue" to differentiate.
geom_smooth(method='auto',formula=y~x, colour = "blue", size =1)
Finally, in the facet_wrap() layer, use the month variable to draw a separate segregation for each month.
facet_wrap(~ month, nrow = 4)
The final code will look like this:
ggplot(data = PM25, aes(x = PRES, y = pm2.5, color = hour)) +geom_point() +geom_smooth(method='auto',formula=y~x, colour = "blue", size =1) +facet_wrap(~ month, nrow = 4)
The plot is as follows:
Activity 6: Transforming Variables and Deriving New Variables to Build a Model
Perform the following steps for building the model:
Import the required libraries and packages into the system:
library(dplyr) library(lubridate) library(tidyr) library(ggplot2) library(grid) library(zoo)
Combine the year, month, day, and hour into a datetime variable:
PM25$datetime <- with(PM25, ymd_h(sprintf('%04d%02d%02d%02d', year, month, day,hour)))
Remove the rows with missing values in any column:
PM25_subset <- na.omit(PM25[,c("datetime","pm2.5")])
Use the rollapply() method from the package zoo to compute the moving average of PM2.5; this is to smoothen any noise from a reading of PM2.5:
PM25_three_hour_pm25_avg <- rollapply(zoo(PM25_subset$pm2.5,PM25_subset$datetime), 3, mean)
Create two levels of the PM25 pollution, 0–Normal, 1-Above Normal. We can also create more than two levels; however, for logistic regression, which works best with binary classification, we have used two levels:
PM25_three_hour_pm25_avg <- as.data.frame(PM25_three_hour_pm25_avg) PM25_three_hour_pm25_avg$timestamp <- row.names(PM25_three_hour_pm25_avg) row.names(PM25_three_hour_pm25_avg) <- NULL colnames(PM25_three_hour_pm25_avg) <- c("avg_pm25","timestamp") PM25_three_hour_pm25_avg$pollution_level <- ifelse(PM25_three_hour_pm25_avg$avg_pm25 <= 35, 0,1) PM25_three_hour_pm25_avg$timestamp <- as.POSIXct(PM25_three_hour_pm25_avg$timestamp, format= "%Y-%m-%d %H:%M:%S",tz="GMT")
Merge the resulting data frame (PM25_three_hour_pm25_avg ) with the values of other environmental variables such as TEMP, DEWP, and Iws, which we used in the linear regression model:
PM25_for_class <- merge(PM25_three_hour_pm25_avg, PM25[,c("datetime","TEMP","DEWP","PRES","Iws","cbwd","Is","Ir")], by.x = "timestamp",by.y = "datetime")
Fit the generalized linear model (glm) on pollution_level using the TEMP, DEWP and Iws variables:
PM25_logit_model <- glm(pollution_level ~ DEWP + TEMP + Iws, data = PM25_for_class,family=binomial(link='logit'))
Summarize the model:
summary(PM25_logit_model)
The output is as follows:
Call: glm(formula = pollution_level ~ DEWP + TEMP + Iws, family = binomial(link = "logit"), data = PM25_for_class) Deviance Residuals: Min 1Q Median 3Q Max -2.4699 -0.5212 0.4569 0.6508 3.5824 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.5240276 0.0273353 92.34 <2e-16 *** DEWP 0.1231959 0.0016856 73.09 <2e-16 *** TEMP -0.1028211 0.0018447 -55.74 <2e-16 *** Iws -0.0127037 0.0003535 -35.94 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 49475 on 41754 degrees of freedom Residual deviance: 37821 on 41751 degrees of freedom AIC: 37829 Number of Fisher Scoring iterations: 5