Chapter 5: Linear and Logistic Regression Models
Activity 18: Implementing Linear Regression
Solution:
- Attach the packages:
# Attach packages
library(groupdata2)
library(cvms)
library(caret)
library(knitr)
- Set the random seed to 1:
# Set seed for reproducibility and easy comparison
set.seed(1)
- Load the cars dataset from caret:
# Load the cars dataset
data(cars)
- Partition the dataset into a training set (80%) and a validation set (20%):
# Partition the dataset
partitions <- partition(cars, p = 0.8)
train_set <- partitions[[1]]
valid_set <- partitions[[2]]
- Fit multiple linear regression models on the training set with the lm() function, predicting Price. Try different predictors. View and interpret the summary() of each fitted model. How do the interpretations change when you add or subtract predictors?
# Fit a couple of linear models and interpret them
# Model 1 - Predicting price by mileage
model_1 <- lm(Price ~ Mileage, data = train_set)
summary(model_1)
The summary of the model is as follows:
##
## Call:
## lm(formula = Price ~ Mileage, data = train_set)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -12977Â Â -7400Â Â -3453Â Â Â 6009Â Â 45540
##
## Coefficients:
##Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Estimate Std. Error t value Pr(>|t|)Â Â Â
## (Intercept)Â Â 2.496e+04Â Â 1.019e+03Â Â 24.488Â Â < 2e-16 ***
## Mileage     -1.736e-01  4.765e-02  -3.644 0.000291 ***
## ---
## Signif. codes:Â Â 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9762 on 641 degrees of freedom
## Multiple R-squared:  0.02029,    Adjusted R-squared:  0.01876
## F-statistic: 13.28 on 1 and 641 DF,  p-value: 0.0002906
# Model 2 - Predicting price by number of doors
model_2 <- lm(Price ~ Doors, data = train_set)
summary(model_2)
The summary of the model is as follows:
##
## Call:
## lm(formula = Price ~ Doors, data = train_set)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -12540Â Â -7179Â Â -2934Â Â Â 5814Â Â 45805
##
## Coefficients:
##Â Â Â Â Â Â Â Â Â Â Â Â Â Estimate Std. Error t value Pr(>|t|)Â Â Â
## (Intercept)Â Â 25682.2Â Â Â Â Â 1662.1Â Â 15.452Â Â Â <2e-16 ***
## Doors        -1176.6      457.5  -2.572   0.0103 *
## ---
## Signif. codes:Â Â 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9812 on 641 degrees of freedom
## Multiple R-squared:  0.01021,    Adjusted R-squared:  0.008671
## F-statistic: 6.615 on 1 and 641 DF,  p-value: 0.01034
# Model 3 - Predicting price by mileage and number of doors
model_3 <- lm(Price ~ Mileage + Doors, data = train_set)
summary(model_3)
The summary of the model is as follows:
##
## Call:
## lm(formula = Price ~ Mileage + Doors, data = train_set)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -12642Â Â -7503Â Â -3000Â Â Â 5595Â Â 43576
##
## Coefficients:
##Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Estimate Std. Error t value Pr(>|t|)Â Â Â
## (Intercept)Â Â 2.945e+04Â Â 1.926e+03Â Â 15.292Â Â < 2e-16 ***
## Mileage     -1.786e-01  4.744e-02  -3.764 0.000182 ***
## Doors       -1.242e+03  4.532e+02  -2.740 0.006308 **
## ---
## Signif. codes:Â Â 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9713 on 640 degrees of freedom
## Multiple R-squared:  0.03165,    Adjusted R-squared:  0.02863
## F-statistic: 10.46 on 2 and 640 DF,  p-value: 3.388e-05
- Create model formulas with combine_predictors(). Limit the number of possibilities by a) using only the first four predictors, b) limiting the number of fixed effects in the formulas to three by specifying max_fixed_effects = 3, c) limiting the biggest possible interaction to a two-way interaction by specifying max_interaction_size = 2, and d) limiting the number of times a predictor can be included in a formula by specifying max_effect_frequency = 1. These limitations will decrease the number of models to run, which you may or may not want in your own projects:
# Create list of model formulas with combine_predictors()
# Use only the 4 first predictors (to save time)
# Limit the number of fixed effects (predictors) to 3,
# Limit the biggest possible interaction to a 2-way interaction
# Limit the number of times a fixed effect is included to 1
model_formulas <- combine_predictors(
    dependent = "Price",
    fixed_effects = c("Mileage", "Cylinder",
                      "Doors", "Cruise"),
    max_fixed_effects = 3,
    max_interaction_size = 2,
    max_effect_frequency = 1)
# Output model formulas
model_formulas
The output is as follows:
##Â Â [1] "Price ~ Cruise"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
##Â Â [2] "Price ~ Cylinder"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
##Â Â [3] "Price ~ Doors"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
##Â Â [4] "Price ~ Mileage"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
##Â Â [5] "Price ~ Cruise * Cylinder"Â Â Â Â Â Â Â Â Â
##Â Â [6] "Price ~ Cruise * Doors"Â Â Â Â Â Â Â Â Â Â Â Â
##Â Â [7] "Price ~ Cruise * Mileage"Â Â Â Â Â Â Â Â Â Â
##Â Â [8] "Price ~ Cruise + Cylinder"Â Â Â Â Â Â Â Â Â
##Â Â [9] "Price ~ Cruise + Doors"Â Â Â Â Â Â Â Â Â Â Â Â
## [10] "Price ~ Cruise + Mileage"Â Â Â Â Â Â Â Â Â Â
## [11] "Price ~ Cylinder * Doors"Â Â Â Â Â Â Â Â Â Â
## [12] "Price ~ Cylinder * Mileage"Â Â Â Â Â Â Â Â
## [13] "Price ~ Cylinder + Doors"Â Â Â Â Â Â Â Â Â Â
## [14] "Price ~ Cylinder + Mileage"Â Â Â Â Â Â Â Â
## [15] "Price ~ Doors * Mileage"Â Â Â Â Â Â Â Â Â Â Â
## [16] "Price ~ Doors + Mileage"Â Â Â Â Â Â Â Â Â Â Â
## [17] "Price ~ Cruise * Cylinder + Doors"
## [18] "Price ~ Cruise * Cylinder + Mileage"
## [19] "Price ~ Cruise * Doors + Cylinder"
## [20] "Price ~ Cruise * Doors + Mileage"Â Â
## [21] "Price ~ Cruise * Mileage + Cylinder"
## [22] "Price ~ Cruise * Mileage + Doors"Â Â
## [23] "Price ~ Cruise + Cylinder * Doors"
## [24] "Price ~ Cruise + Cylinder * Mileage"
## [25] "Price ~ Cruise + Cylinder + Doors"
## [26] "Price ~ Cruise + Cylinder + Mileage"
## [27] "Price ~ Cruise + Doors * Mileage"Â Â
## [28] "Price ~ Cruise + Doors + Mileage"Â Â
## [29] "Price ~ Cylinder * Doors + Mileage"
## [30] "Price ~ Cylinder * Mileage + Doors"
## [31] "Price ~ Cylinder + Doors * Mileage"
## [32] "Price ~ Cylinder + Doors + Mileage"
- Create fivefold columns with four folds each in the training set, using fold() with k = 4 and num_fold_cols = 5. Feel free to choose a higher number of fold columns:
# Create 5 fold columns with 4 folds each in the training set
train_set <- fold(train_set, k = 4,
                  num_fold_cols = 5)
- Create the fold column names with paste0():
# Create list of fold column names
fold_cols <- paste0(".folds_", 1:5)
- Perform repeated cross-validation on your model formulas with cvms:
# Cross-validate the models with cvms
CV_results <- cross_validate(train_set,
                             models = model_formulas,
                             fold_cols = fold_cols,
                             family = "gaussian")
- Print the top 10 performing models according to RMSE. Select the best model:
# Select the best model by RMSE
# Order by RMSE
CV_results <- CV_results[order(CV_results$RMSE),]
# Select the 10 best performing models for printing
# (Feel free to view all the models)
CV_results_top10 <- head(CV_results, 10)
# Show metrics and model definition columns
# Use kable for a prettier output
kable(select_metrics(CV_results_top10), digits = 2)
The output is as follows:
Figure 5.29: Top 10 performing models using RMSE
- Fit the best model on the entire training set and evaluate it on the validation set. This can be done with the validate() function in cvms:
# Evaluate the best model on the validation set with validate()
V_results <- validate(
    train_data = train_set,
    test_data = valid_set,
    models = "Price ~ Cruise * Cylinder + Mileage",
    family = "gaussian")
- The output contains the results data frame and the trained model. Assign these to variable names:
valid_results <- V_results$Results
valid_model <- V_results$Models[[1]]
- Print the results:
# Print the results
kable(select_metrics(valid_results), digits = 2)
The output is as follows:
Figure 5.30: Results of the validated model
- View and interpret the summary of the best model:
# Print the model summary and interpret it
summary(valid_model)
The summary is as follows:
##
## Call:
## lm(formula = model_formula, data = train_set)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -10485Â Â -5495Â Â -1425Â Â Â 3494Â Â 34693
##
## Coefficients:
##Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Estimate Std. Error t value Pr(>|t|)Â Â Â
## (Intercept)Â Â Â Â Â Â 8993.2446Â Â 3429.9320Â Â Â 2.622Â Â 0.00895 **
## Cruise          -1311.6871  3585.6289  -0.366  0.71462  Â
## Cylinder         1809.5447   741.9185   2.439  0.01500 *
## Mileage            -0.1569     0.0367  -4.274 2.21e-05 ***
## Cruise:Cylinder  1690.0768   778.7838   2.170  0.03036 *
## ---
## Signif. codes:Â Â 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7503 on 638 degrees of freedom
## Multiple R-squared:  0.424,  Adjusted R-squared:  0.4203
## F-statistic: 117.4 on 4 and 638 DF,  p-value: < 2.2e-16
Activity 19: Classifying Room Types
Solution:
- Attach the groupdata2, cvms, caret, randomForest, rPref, and doParallel packages:
library(groupdata2)
library(cvms)
library(caret)
library(randomForest)
library(rPref)
library(doParallel)
- Set the random seed to 3:
set.seed(3)
- Load the amsterdam.listings dataset from https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/amsterdam.listings.csv:
# Load the amsterdam.listings dataset
full_data <- read.csv("amsterdam.listings.csv")
- Convert the id and neighbourhood columns to factors:
full_data$id <- factor(full_data$id)
full_data$neighbourhood <- factor(full_data$neighbourhood)
- Summarize the dataset:
summary(full_data)
The summary of data is as follows:
##        id                        neighbourhood            room_type  Â
##Â Â 2818Â Â Â :Â Â Â Â 1Â Â Â De Baarsjes - Oud-West :3052Â Â Â Entire home/apt:13724
##  20168  :    1   De Pijp - Rivierenbuurt:2166   Private room   : 3594
##  25428  :    1   Centrum-West           :2019                        Â
##  27886  :    1   Centrum-Oost           :1498                        Â
##  28658  :    1   Westerpark             :1315                        Â
##  28871  :    1   Zuid                   :1187                        Â
##Â Â (Other):17312Â Â Â (Other)Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â :6081Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
##  availability_365   log_price     log_minimum_nights log_number_of_reviews
##  Min.   :  0.00   Min.   :2.079   Min.   :0.0000     Min.   :0.000      Â
##Â Â 1st Qu.:Â Â 0.00Â Â Â 1st Qu.:4.595Â Â Â 1st Qu.:0.6931Â Â Â Â Â 1st Qu.:1.386Â Â Â Â Â Â Â
##Â Â Median :Â Â 0.00Â Â Â Median :4.852Â Â Â Median :0.6931Â Â Â Â Â Median :2.398Â Â Â Â Â Â Â
##  Mean   : 48.33   Mean   :4.883   Mean   :0.8867     Mean   :2.370      Â
##Â Â 3rd Qu.: 49.00Â Â Â 3rd Qu.:5.165Â Â Â 3rd Qu.:1.0986Â Â Â Â Â 3rd Qu.:3.219Â Â Â Â Â Â Â
##  Max.   :365.00   Max.   :7.237   Max.   :4.7875     Max.   :6.196      Â
##Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
##Â Â log_reviews_per_month
##  Min.   :-4.60517   Â
##Â Â 1st Qu.:-1.42712Â Â Â Â
##Â Â Median :-0.61619Â Â Â Â
##  Mean   :-0.67858   Â
##Â Â 3rd Qu.: 0.07696Â Â Â Â
##  Max.   : 2.48907   Â
##
- Partition the dataset into a training set (80%) and a validation set (20%). Balance the partitions by room_type:
partitions <- partition(full_data, p = 0.8,
                        cat_col = "room_type")
train_set <- partitions[[1]]
valid_set <- partitions[[2]]
- Prepare for running the baseline evaluations and the cross-validations in parallel by registering the number of cores for doParallel:
# Register four CPU cores
registerDoParallel(4)
- Create the baseline evaluation for the task on the validation set with the baseline() function from cvms. Run 100 evaluations in parallel. Specify the dependent column as room_type. Note that the default positive class is Private room:
room_type_baselines <- baseline(test_data = valid_set,
                                dependent_col = "room_type",
                                n = 100,
                                family = "binomial",
                                parallel = TRUE)
# Inspect summarized metrics
room_type_baselines$summarized_metrics
The output is as follows:
## # A tibble: 10 x 15
##    Measure 'Balanced Accur…       F1 Sensitivity Specificity
##Â Â Â Â <chr>Â Â Â Â Â Â Â Â Â Â Â Â Â Â <dbl>Â Â Â Â <dbl>Â Â Â Â Â Â Â <dbl>Â Â Â Â Â Â Â <dbl>
##  1 Mean              0.502   0.295        0.503      0.500
##  2 Median            0.502   0.294        0.502      0.500
##Â Â 3 SDÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 0.0101Â Â 0.00954Â Â Â Â Â Â 0.0184Â Â Â Â Â 0.00924
##Â Â 4 IQRÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â 0.0154Â Â 0.0131Â Â Â Â Â Â Â 0.0226Â Â Â Â Â 0.0122
##  5 Max               0.525   0.317        0.551      0.522
##  6 Min               0.480   0.275        0.463      0.480
##  7 NAs               0       0            0          0    Â
##  8 INFs              0       0            0          0    Â
##Â Â 9 All_0Â Â Â Â Â Â Â Â Â Â Â Â Â 0.5Â Â Â Â NAÂ Â Â Â Â Â Â Â Â Â Â Â 0Â Â Â Â Â Â Â Â Â Â 1Â Â Â Â Â
## 10 All_1Â Â Â Â Â Â Â Â Â Â Â Â Â 0.5Â Â Â Â Â 0.344Â Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â Â Â 0Â Â Â Â Â
## # … with 10 more variables: 'Pos Pred Value' <dbl>, 'Neg Pred
## #Â Â Â Value' <dbl>, AUC <dbl>, 'Lower CI' <dbl>, 'Upper CI' <dbl>,
## #Â Â Â Kappa <dbl>, MCC <dbl>, 'Detection Rate' <dbl>, 'Detection
## #Â Â Â Prevalence' <dbl>, Prevalence <dbl>
- Fit multiple logistic regression models on the training set with the glm() function, predicting room_type. Try different predictors. View the summary() of each fitted model and try to interpret the estimated coefficients. Observe how the interpretations change when you add or subtract predictors:
logit_model_1 <- glm("room_type ~ log_number_of_reviews",
                     data = train_set, family="binomial")
summary(logit_model_1)
The summary of the model is as follows:
## ## Call:
## glm(formula = "room_type ~ log_number_of_reviews", family = "binomial",
##Â Â Â Â Â data = train_set)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.3030Â Â -0.7171Â Â -0.5668Â Â -0.3708Â Â Â 2.3288
##
## Coefficients:
##Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Estimate Std. Error z value Pr(>|z|)Â Â Â
## (Intercept)Â Â Â Â Â Â Â Â Â Â Â -2.64285Â Â Â Â 0.05374Â Â -49.18Â Â Â <2e-16 ***
## log_number_of_reviews  0.49976    0.01745   28.64   <2e-16 ***
## ---
## Signif. codes:Â Â 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##Â Â Â Â Â Null deviance: 14149Â Â on 13853Â Â degrees of freedom
## Residual deviance: 13249Â Â on 13852Â Â degrees of freedom
## AIC: 13253
##
## Number of Fisher Scoring iterations: 4
- Add availability_365 as a predictor:
logit_model_2 <- glm(
    "room_type ~ availability_365 + log_number_of_reviews",
    data = train_set, family = "binomial")
summary(logit_model_2)
The summary of the model is as follows:
##
## Call:
## glm(formula = "room_type ~ availability_365 + log_number_of_reviews",
##Â Â Â Â Â family = "binomial", data = train_set)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.5277Â Â -0.6735Â Â -0.5535Â Â -0.3688Â Â Â 2.3365
##
## Coefficients:
##Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Estimate Std. Error z value Pr(>|z|)Â Â Â
## (Intercept)Â Â Â Â Â Â Â Â Â Â Â -2.6622105Â Â 0.0533345Â Â -49.91Â Â Â <2e-16 ***
## availability_365Â Â Â Â Â Â Â 0.0039866Â Â 0.0002196Â Â Â 18.16Â Â Â <2e-16 ***
## log_number_of_reviews  0.4172148  0.0178015   23.44   <2e-16 ***
## ---
## Signif. codes:Â Â 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##Â Â Â Â Â Null deviance: 14149Â Â on 13853Â Â degrees of freedom
## Residual deviance: 12935Â Â on 13851Â Â degrees of freedom
## AIC: 12941
##
## Number of Fisher Scoring iterations: 4
- Add log_price as a predictor:
logit_model_3 <- glm(
    "room_type ~ availability_365 + log_number_of_reviews + log_price",
    data = train_set, family = "binomial")
summary(logit_model_3)
The summary of the model is as follows:
##
## Call:
## glm(formula = "room_type ~ availability_365 + log_number_of_reviews + log_price",
##Â Â Â Â Â family = "binomial", data = train_set)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -3.7678Â Â -0.5805Â Â -0.3395Â Â -0.1208Â Â Â 3.9864
##
## Coefficients:
##Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Estimate Std. Error z value Pr(>|z|)Â Â Â
## (Intercept)Â Â Â Â Â Â Â Â Â Â Â 12.6730455Â Â 0.3419771Â Â Â 37.06Â Â Â <2e-16 ***
## availability_365Â Â Â Â Â Â Â 0.0081215Â Â 0.0002865Â Â Â 28.34Â Â Â <2e-16 ***
## log_number_of_reviews  0.3613055  0.0199845   18.08   <2e-16 ***
## log_price             -3.2539506  0.0745417  -43.65   <2e-16 ***
## ---
## Signif. codes:Â Â 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##Â Â Â Â Â Null deviance: 14149.2Â Â on 13853Â Â degrees of freedom
## Residual deviance:Â Â 9984.7Â Â on 13850Â Â degrees of freedom
## AIC: 9992.7
##
## Number of Fisher Scoring iterations: 6
- Add log_minimum_nights as a predictor:
logit_model_4 <- glm(
    "room_type ~ availability_365 + log_number_of_reviews + log_price + log_minimum_nights",
     data = train_set, family = "binomial")
summary(logit_model_4)
The summary of the model is as follows:
##
## Call:
## glm(formula = "room_type ~ availability_365 + log_number_of_reviews + log_price + log_minimum_nights",
##Â Â Â Â Â family = "binomial", data = train_set)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -3.6268Â Â -0.5520Â Â -0.3142Â Â -0.1055Â Â Â 4.6062
##
## Coefficients:
##Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Estimate Std. Error z value Pr(>|z|)Â Â Â
## (Intercept)Â Â Â Â Â Â Â Â Â Â Â 13.470868Â Â Â 0.354695Â Â Â 37.98Â Â Â <2e-16 ***
## availability_365Â Â Â Â Â Â Â 0.008310Â Â Â 0.000295Â Â Â 28.17Â Â Â <2e-16 ***
## log_number_of_reviews  0.360133   0.020422   17.64   <2e-16 ***
## log_price             -3.252957   0.076343  -42.61   <2e-16 ***
## log_minimum_nights    -1.007354   0.051131  -19.70   <2e-16 ***
## ---
## Signif. codes:Â Â 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##Â Â Â Â Â Null deviance: 14149.2Â Â on 13853Â Â degrees of freedom
## Residual deviance:Â Â 9543.9Â Â on 13849Â Â degrees of freedom
## AIC: 9553.9
##
## Number of Fisher Scoring iterations: 6
- Replace log_number_of_reviews with log_reviews_per_month:
logit_model_5 <- glm(
    "room_type ~ availability_365 + log_reviews_per_month + log_price + log_minimum_nights",
    data = train_set, family = "binomial")
summary(logit_model_5)
The summary of the model is as follows:
##
## Call:
## glm(formula = "room_type ~ availability_365 + log_reviews_per_month + log_price + log_minimum_nights",
##Â Â Â Â Â family = "binomial", data = train_set)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -3.7351Â Â -0.5229Â Â -0.2934Â Â -0.0968Â Â Â 4.7252
##
## Coefficients:
##Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Estimate Std. Error z value Pr(>|z|)Â Â Â
## (Intercept)Â Â Â Â Â Â Â Â Â Â Â 14.7495851Â Â 0.3652303Â Â Â 40.38Â Â Â <2e-16 ***
## availability_365Â Â Â Â Â Â Â 0.0074308Â Â 0.0003019Â Â Â 24.61Â Â Â <2e-16 ***
## log_reviews_per_month  0.6364218  0.0246423   25.83   <2e-16 ***
## log_price             -3.2850702  0.0781567  -42.03   <2e-16 ***
## log_minimum_nights    -0.8504701  0.0526379  -16.16   <2e-16 ***
## ---
## Signif. codes:Â Â 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##Â Â Â Â Â Null deviance: 14149Â Â on 13853Â Â degrees of freedom
## Residual deviance:Â Â 9103Â Â on 13849Â Â degrees of freedom
## AIC: 9113
##
## Number of Fisher Scoring iterations: 6
- Create model formulas with combine_predictors(). To save time, limit the interaction size to 2 by specifying max_interaction_size = 2, and limit the number of times an effect can be included in a formula to 1 by specifying max_effect_frequency = 1:
model_formulas <- combine_predictors(
    dependent = "room_type",
    fixed_effects = c("log_minimum_nights",
                      "log_number_of_reviews",
                      "log_price",
                      "availability_365",
                      "log_reviews_per_month"),
    max_interaction_size = 2,
    max_effect_frequency = 1)
head(model_formulas, 10)
The output is as follows:
##Â Â [1] "room_type ~ availability_365"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
##Â Â [2] "room_type ~ log_minimum_nights"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
##Â Â [3] "room_type ~ log_number_of_reviews"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
##Â Â [4] "room_type ~ log_price"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
##Â Â [5] "room_type ~ log_reviews_per_month"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
##Â Â [6] "room_type ~ availability_365 * log_minimum_nights"Â Â
##Â Â [7] "room_type ~ availability_365 * log_number_of_reviews"
##Â Â [8] "room_type ~ availability_365 * log_price"Â Â Â Â Â Â Â Â Â Â Â
##Â Â [9] "room_type ~ availability_365 * log_reviews_per_month"
## [10] "room_type ~ availability_365 + log_minimum_nights"
- Create fivefold columns with five folds each in the training set, using fold() with k = 5 and num_fold_cols = 5. Balance the folds by room_type. Feel free to choose a higher number of fold columns:
train_set <- fold(train_set, k = 5,
                  num_fold_cols = 5,
                  cat_col = "room_type")
- Perform cross-validation (not repeated) on your model formulas with cvms. Specify fold_cols = ".folds_1". Order the results by F1 and show the best 10 models:
initial_cv_results <- cross_validate(
    train_set,
    models = model_formulas,
    fold_cols = ".folds_1",
    family = "binomial",
    parallel = TRUE)
initial_cv_results <- initial_cv_results[
    order(initial_cv_results$F1, decreasing = TRUE),]
head(initial_cv_results, 10)
The output is as follows:
## # A tibble: 10 x 26
##    'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'
##Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â <dbl> <dbl>Â Â Â Â Â Â Â <dbl>Â Â Â Â Â Â Â <dbl>Â Â Â Â Â Â Â Â Â Â Â Â <dbl>
##Â Â 1Â Â Â Â Â Â Â Â Â Â Â Â 0.764 0.662Â Â Â Â Â Â Â 0.567Â Â Â Â Â Â Â 0.962Â Â Â Â Â Â Â Â Â Â Â Â 0.797
##Â Â 2Â Â Â Â Â Â Â Â Â Â Â Â 0.764 0.662Â Â Â Â Â Â Â 0.565Â Â Â Â Â Â Â 0.963Â Â Â Â Â Â Â Â Â Â Â Â 0.798
##Â Â 3Â Â Â Â Â Â Â Â Â Â Â Â 0.763 0.662Â Â Â Â Â Â Â 0.563Â Â Â Â Â Â Â 0.963Â Â Â Â Â Â Â Â Â Â Â Â 0.801
##Â Â 4Â Â Â Â Â Â Â Â Â Â Â Â 0.763 0.661Â Â Â Â Â Â Â 0.563Â Â Â Â Â Â Â 0.963Â Â Â Â Â Â Â Â Â Â Â Â 0.800
##Â Â 5Â Â Â Â Â Â Â Â Â Â Â Â 0.761 0.654Â Â Â Â Â Â Â 0.564Â Â Â Â Â Â Â 0.958Â Â Â Â Â Â Â Â Â Â Â Â 0.778
##Â Â 6Â Â Â Â Â Â Â Â Â Â Â Â 0.757 0.653Â Â Â Â Â Â Â 0.549Â Â Â Â Â Â Â 0.966Â Â Â Â Â Â Â Â Â Â Â Â 0.807
##Â Â 7Â Â Â Â Â Â Â Â Â Â Â Â 0.757 0.652Â Â Â Â Â Â Â 0.549Â Â Â Â Â Â Â 0.965Â Â Â Â Â Â Â Â Â Â Â Â 0.804
##Â Â 8Â Â Â Â Â Â Â Â Â Â Â Â 0.758 0.649Â Â Â Â Â Â Â 0.560Â Â Â Â Â Â Â 0.957Â Â Â Â Â Â Â Â Â Â Â Â 0.774
##Â Â 9Â Â Â Â Â Â Â Â Â Â Â Â 0.756 0.649Â Â Â Â Â Â Â 0.550Â Â Â Â Â Â Â 0.962Â Â Â Â Â Â Â Â Â Â Â Â 0.792
## 10Â Â Â Â Â Â Â Â Â Â Â Â 0.758 0.649Â Â Â Â Â Â Â 0.559Â Â Â Â Â Â Â 0.957Â Â Â Â Â Â Â Â Â Â Â Â 0.775
## # … with 21 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower
## #Â Â Â CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection
## #Â Â Â Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,
## #Â Â Â Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,
## #Â Â Â Coefficients <list>, Folds <int>, 'Fold Columns' <int>, 'Convergence
## #Â Â Â Warnings' <dbl>, 'Singular Fit Messages' <int>, Family <chr>,
## #Â Â Â Link <chr>, Dependent <chr>, Fixed <chr>
- Perform repeated cross-validation on the 10-20 best model formulas (by F1) with cvms:
# Reconstruct the best 20 models' formulas
reconstructed_formulas <- reconstruct_formulas(
    initial_cv_results,
    topn = 20)
# Create fold_cols
fold_cols <- paste0(".folds_", 1:5)
# Perform repeated cross-validation
repeated_cv_results <- cross_validate(
    train_set,
    models = model_formulas,
    fold_cols = fold_cols,
    family = "binomial",
    parallel = TRUE)
# Order by F1
repeated_cv_results <- repeated_cv_results[
    order(repeated_cv_results$F1, decreasing = TRUE),]
# Inspect the 10 best modelsresults
head(repeated_cv_results)
The output is as follows:
## # A tibble: 6 x 27
##   'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'
##Â Â Â Â Â Â Â Â Â Â Â Â Â Â <dbl> <dbl>Â Â Â Â Â Â Â <dbl>Â Â Â Â Â Â Â <dbl>Â Â Â Â Â Â Â Â Â Â Â Â <dbl>
## 1Â Â Â Â Â Â Â Â Â Â Â Â 0.764 0.662Â Â Â Â Â Â Â 0.566Â Â Â Â Â Â Â 0.962Â Â Â Â Â Â Â Â Â Â Â Â 0.796
## 2Â Â Â Â Â Â Â Â Â Â Â Â 0.763 0.661Â Â Â Â Â Â Â 0.564Â Â Â Â Â Â Â 0.963Â Â Â Â Â Â Â Â Â Â Â Â 0.798
## 3Â Â Â Â Â Â Â Â Â Â Â Â 0.763 0.660Â Â Â Â Â Â Â 0.562Â Â Â Â Â Â Â 0.963Â Â Â Â Â Â Â Â Â Â Â Â 0.800
## 4Â Â Â Â Â Â Â Â Â Â Â Â 0.762 0.659Â Â Â Â Â Â Â 0.561Â Â Â Â Â Â Â 0.963Â Â Â Â Â Â Â Â Â Â Â Â 0.800
## 5Â Â Â Â Â Â Â Â Â Â Â Â 0.761 0.654Â Â Â Â Â Â Â 0.563Â Â Â Â Â Â Â 0.958Â Â Â Â Â Â Â Â Â Â Â Â 0.780
## 6Â Â Â Â Â Â Â Â Â Â Â Â 0.758 0.654Â Â Â Â Â Â Â 0.551Â Â Â Â Â Â Â 0.965Â Â Â Â Â Â Â Â Â Â Â Â 0.805
## # … with 22 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower
## #Â Â Â CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection
## #Â Â Â Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,
## #Â Â Â Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,
## #Â Â Â Coefficients <list>, Results <list>, Folds <int>, 'Fold
## #Â Â Â Columns' <int>, 'Convergence Warnings' <dbl>, 'Singular Fit
## #Â Â Â Messages' <int>, Family <chr>, Link <chr>, Dependent <chr>,
## #Â Â Â Fixed <chr>
- Find the Pareto front based on the F1 and balanced accuracy scores. Use psel() from the rPref package and specify pref = high("F1") * high("`Balanced Accuracy`"). Note the ticks around `Balanced Accuracy`:
# Find the Pareto front
front <- psel(repeated_cv_results,
              pref = high("F1") * high("`Balanced Accuracy`"))
# Remove rows with NA in F1 or Balanced Accuracy
front <- front[complete.cases(front[1:2]), ]
# Inspect front
front
The output is as follows:
## # A tibble: 1 x 27
##   'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'
##Â Â Â Â Â Â Â Â Â Â Â Â Â Â <dbl> <dbl>Â Â Â Â Â Â Â <dbl>Â Â Â Â Â Â Â <dbl>Â Â Â Â Â Â Â Â Â Â Â Â <dbl>
## 1Â Â Â Â Â Â Â Â Â Â Â Â 0.764 0.662Â Â Â Â Â Â Â 0.566Â Â Â Â Â Â Â 0.962Â Â Â Â Â Â Â Â Â Â Â Â 0.796
## # … with 22 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower
## #Â Â Â CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection
## #Â Â Â Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,
## #Â Â Â Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,
## #Â Â Â Coefficients <list>, Results <list>, Folds <int>, 'Fold
## #Â Â Â Columns' <int>, 'Convergence Warnings' <dbl>, 'Singular Fit
## #Â Â Â Messages' <int>, Family <chr>, Link <chr>, Dependent <chr>,
## #Â Â Â Fixed <chr>
The best model according to F1 is also the best model by balanced accuracy, so the Pareto front only contains one model.
- Plot the Pareto front with the ggplot2 code from Exercise 61, Plotting the Pareto Front. Note that you may need to add ticks around 'Balanced Accuracy' when specifying x or y in aes() in the ggplot call:
# Create ggplot object
# with precision on the x-axis and precision on the y-axis
ggplot(repeated_cv_results, aes(x = F1, y = 'Balanced Accuracy')) +
  # Add the models as points
  geom_point(shape = 1, size = 0.5) +
  # Add the nondominated models as larger points
  geom_point(data = front, size = 3) +
  # Add a line to visualize the pareto front
  geom_step(data = front, direction = "vh") +
  # Add the light theme
  theme_light()
The output is similar to the following:
Figure 5.31: Pareto front with the F1 and balanced accuracy scores
- Use validate() to train the nondominated models on the training set and evaluate them on the validation set:
# Reconstruct the formulas for the front models
reconstructed_formulas <- reconstruct_formulas(front)
# Validate the models in the Pareto front
v_results_list <- validate(train_data = train_set,
                           test_data = valid_set,
                           models = reconstructed_formulas,
                           family = "binomial")
# Assign the results and model(s) to variable names
v_results <- v_results_list$Results
v_model <- v_results_list$Models[[1]]
v_results
The output is as follows:
## # A tibble: 1 x 24
##   'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'
##Â Â Â Â Â Â Â Â Â Â Â Â Â Â <dbl> <dbl>Â Â Â Â Â Â Â <dbl>Â Â Â Â Â Â Â <dbl>Â Â Â Â Â Â Â Â Â Â Â Â <dbl>
## 1Â Â Â Â Â Â Â Â Â Â Â Â 0.758 0.652Â Â Â Â Â Â Â 0.554Â Â Â Â Â Â Â 0.962Â Â Â Â Â Â Â Â Â Â Â Â 0.794
## # … with 19 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower
## #Â Â Â CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection
## #Â Â Â Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,
## #Â Â Â Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,
## #Â Â Â Coefficients <list>, 'Convergence Warnings' <dbl>, 'Singular Fit
## #Â Â Â Messages' <dbl>, Family <chr>, Link <chr>, Dependent <chr>,
## #Â Â Â Fixed <chr>
These results are a lot better than the baseline on both F1 and balanced accuracy.
- View the summaries of the nondominated model(s):
summary(v_model)
The summary of the model is as follows:
##
## Call:
## glm(formula = model_formula, family = binomial(link = link),
##Â Â Â Â Â data = train_set)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -3.8323Â Â -0.4836Â Â -0.2724Â Â -0.0919Â Â Â 3.9091
##
## Coefficients:
##                           Estimate Std. Error z value  Pr(>|z|)
## (Intercept)Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 15.3685268Â Â 0.4511978Â Â 34.062Â Â < 2e-16 ***
## availability_365Â Â Â Â Â Â Â Â Â Â -0.0140209Â Â 0.0030623Â Â -4.579 4.68e-06 ***
## log_price                 -3.4441520  0.0956189 -36.020  < 2e-16 ***
## log_minimum_nights        -0.7163252  0.0535452 -13.378  < 2e-16 ***
## log_number_of_reviews      -0.0823821  0.0282115  -2.920   0.0035 **
## log_reviews_per_month      0.0733808  0.0381629   1.923   0.0545 .
## availability_365:log_price 0.0042772  0.0006207   6.891 5.53e-12 ***
## log_n_o_reviews:log_r_p_month 0.3730603Â 0.0158122Â 23.593Â < 2e-16 ***
## ---
## Signif. codes:Â Â 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##Â Â Â Â Â Null deviance: 14149.2Â Â on 13853Â Â degrees of freedom
## Residual deviance:Â Â 8476.7Â Â on 13846Â Â degrees of freedom
## AIC: 8492.7
##
## Number of Fisher Scoring iterations: 6
Note, that we have shortened log_number_of_reviews and log_reviews_per_month in the interaction term, log_n_o_reviews:log_r_p_month. The coefficients for the two interaction terms are both statistically significant. The interaction term log_n_o_reviews:log_r_p_month tells us that when log_number_of_reviews increases by one unit, the coefficient for log_reviews_per_month increases by 0.37, and vice versa. We might question the meaningfulness of including both these predictors in the model, as they have some of the same information. If we also had the number of months the listing had been listed, we should be able to recreate log_number_of_reviews from log_reviews_per_month and the number of months, which would probably be easier to interpret as well.
The second interaction term, availability_365:log_price, tells us that when availability_365 increases by a single unit, the coefficient for log_price increases by 0.004, and vice versa. The coefficient estimate for log_price is -3.44, meaning that when availability_365 is low, a higher log_price decreases the probability that the listing is a private room. This fits with the intuition that a private room is usually cheaper than an entire home/apartment.
The coefficient for log_minimum_nights tells us that when there is a higher minimum requirement for the number of nights when we book the listing, there's a lower probability that the listing is a private room.