Chapter 5: Classification
Activity 8: Building a Logistic Regression Model with Additional Features
Create a copy of the df_new data frame into df_copy for the activity:
df_copy <- df_new
Create new features for square root, square power, and cube power transformations for each of the three selected numeric features:
df_copy$MaxTemp2 <- df_copy$MaxTemp ^2 df_copy$MaxTemp3 <- df_copy$MaxTemp ^3 df_copy$MaxTemp_root <- sqrt(df_copy$MaxTemp) df_copy$Rainfall2 <- df_copy$Rainfall ^2 df_copy$Rainfall3 <- df_copy$Rainfall ^3 df_copy$Rainfall_root <- sqrt(df_copy$Rainfall) df_copy$Humidity3pm2 <- df_copy$Humidity3pm ^2 df_copy$Humidity3pm3 <- df_copy$Humidity3pm ^3 df_copy$Humidity3pm_root <- sqrt(df_copy$Humidity3pm)
Divide the df_copy dataset into train and test in 70:30 ratio:
#Setting seed for reproducibility set.seed(2019) #Creating a list of indexes for the training dataset (70%) train_index <- sample(seq_len(nrow(df_copy)),floor(0.7 * nrow(df_copy))) #Split the data into test and train train_new <- df_copy[train_index,] test_new <- df_copy[-train_index,]
Fit the logistic regression model with the new training data:
model <- glm(RainTomorrow~., data=train_new ,family=binomial(link='logit'))
Predict the responses using the fitted model on the train data and create a confusion matrix:
print("Training data results -") pred_train <-factor(ifelse(predict(model,newdata=train_new, type = "response") > 0.5,"Yes","No")) #Create the Confusion Matrix train_metrics <- confusionMatrix(pred_train, train_new$RainTomorrow,positive="Yes") print(train_metrics)
The output is as follows:
"Training data results -" Confusion Matrix and Statistics Reference Prediction No Yes No 58330 8650 Yes 3161 8906 Accuracy : 0.8506 95% CI : (0.8481, 0.8531) No Information Rate : 0.7779 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.5132 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.5073 Specificity : 0.9486 Pos Pred Value : 0.7380 Neg Pred Value : 0.8709 Prevalence : 0.2221 Detection Rate : 0.1127 Detection Prevalence : 0.1527 Balanced Accuracy : 0.7279 'Positive' Class : Yes
Predict the responses using the fitted model on test data and create a confusion matrix:
print("Test data results -") pred_test <-factor(ifelse(predict(model,newdata=test_new, type = "response") > 0.5,"Yes","No")) #Create the Confusion Matrix test_metrics <- confusionMatrix(pred_test, test_new$RainTomorrow,positive="Yes") print(test_metrics)
The output is as follows:
"Test data results -" Confusion Matrix and Statistics Reference Prediction No Yes No 25057 3640 Yes 1358 3823 Accuracy : 0.8525 95% CI : (0.8486, 0.8562) No Information Rate : 0.7797 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.5176 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.5123 Specificity : 0.9486 Pos Pred Value : 0.7379 Neg Pred Value : 0.8732 Prevalence : 0.2203 Detection Rate : 0.1128 Detection Prevalence : 0.1529 Balanced Accuracy : 0.7304 'Positive' Class : Yes
Activity 9: Create a Decision Tree Model with Additional Control Parameters
Load the rpart library.
library(rpart)
Create the control object for decision tree with new values minsplit =15 and cp = 0.00:
control = rpart.control( minsplit = 15, cp = 0.001)
Fit the tree model with the train data and pass the control object to the rpart function:
tree_model <- rpart(RainTomorrow~.,data=train, control = control)
Plot the complexity parameter plot to see how the tree performs at different values of CP:
plotcp(tree_model)
The output is as follows:
Use the fitted model to make predictions on train data and create the confusion matrix:
print("Training data results -") pred_train <- predict(tree_model,newdata = train,type = "class") confusionMatrix(pred_train, train$RainTomorrow,positive="Yes")
The output is as follows:
"Training data results -" Confusion Matrix and Statistics Reference Prediction No Yes No 58494 9032 Yes 2997 8524 Accuracy : 0.8478 95% CI : (0.8453, 0.8503) No Information Rate : 0.7779 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.4979 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.4855 Specificity : 0.9513 Pos Pred Value : 0.7399 Neg Pred Value : 0.8662 Prevalence : 0.2221 Detection Rate : 0.1078 Detection Prevalence : 0.1457 Balanced Accuracy : 0.7184 'Positive' Class : Yes
Use the fitted model to make predictions on test data and create the confusion matrix:
print("Test data results -") pred_test <- predict(tree_model,newdata = test,type = "class") confusionMatrix(pred_test, test$RainTomorrow,positive="Yes")
The output is as follows:
"Test data results -" Confusion Matrix and Statistics Reference Prediction No Yes No 25068 3926 Yes 1347 3537 Accuracy : 0.8444 95% CI : (0.8404, 0.8482) No Information Rate : 0.7797 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.4828 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.4739 Specificity : 0.9490 Pos Pred Value : 0.7242 Neg Pred Value : 0.8646 Prevalence : 0.2203 Detection Rate : 0.1044 Detection Prevalence : 0.1442 Balanced Accuracy : 0.7115 'Positive' Class : Yes
Activity 10: Build a Random Forest Model with a Greater Number of Trees
First, import the randomForest library using the following command:
library(randomForest)
Build random forest model with all independent features available. Define the number of trees in the model to be 500.
rf_model <- randomForest(RainTomorrow ~ . , data = train, ntree = 500, importance = TRUE, maxnodes=60)
Evaluate on training data:
print("Training data results -") pred_train <- predict(rf_model,newdata = train,type = "class") confusionMatrix(pred_train, train$RainTomorrow,positive="Yes")
The output is as follows:
"Training data results -" Confusion Matrix and Statistics Reference Prediction No Yes No 59638 10169 Yes 1853 7387 Accuracy : 0.8479 95% CI : (0.8454, 0.8504) No Information Rate : 0.7779 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.4702 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.42077 Specificity : 0.96987 Pos Pred Value : 0.79946 Neg Pred Value : 0.85433 Prevalence : 0.22210 Detection Rate : 0.09345 Detection Prevalence : 0.11689 Balanced Accuracy : 0.69532 'Positive' Class : Yes
Evaluate on test data:
print("Test data results -") pred_test <- predict(rf_model,newdata = test,type = "class") confusionMatrix(pred_test, test$RainTomorrow,positive="Yes")
The output is as follows:
"Test data results -" Confusion Matrix and Statistics Reference Prediction No Yes No 25604 4398 Yes 811 3065 Accuracy : 0.8462 95% CI : (0.8424, 0.8501) No Information Rate : 0.7797 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.4592 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.41069 Specificity : 0.96930 Pos Pred Value : 0.79076 Neg Pred Value : 0.85341 Prevalence : 0.22029 Detection Rate : 0.09047 Detection Prevalence : 0.11441 Balanced Accuracy : 0.69000 'Positive' Class : Yes