Statistical/machine learning models
The previous section introduced a host of problems through real datasets, and we will now discuss some standard model variants that are useful for dealing with such problems. First, we set up the required mathematical framework.
Suppose that we have n independent pairs of observations, , where denotes the random variable of interest, also known as the dependent variable, regress and, endogenous variable, and so on. is the associated vector of explanatory variables, or independent/exogenous variables. The explanatory vector will consist of k elements, that is, . The data realized is of the form , where is the realized value (data) of random variable . A convention will be adapted throughout the book that , and this will take care of the intercept term. We assume that the observations are from the true distribution F, which is not completely known. The general regression model, including the classification model as well as the regression model, is specified by:
Here, the function f is an unknown function and is the regression parameter, which captures the influence of on . The error is the associated unobservable error term. Diverse methods can be applied to model the relationship between the Ys and the xes
. The statistical regression model focused on the complete specification of the error distribution , and in general the functional form would be linear as in . The function is the link function in the class of generalized linear models. Nonparametric and semiparametric regression models are more flexible, as we don't place a restriction on the error's probability distribution. Flexibility would come with a price though, and here we need a much higher number of observations to make a valid inference, although that number is unspecified and is often subjective.
The machine learning paradigm includes some black box methods, and we have a healthy overlap between this paradigm and non- and semi-parametric models. The reader is also cautioned that black box does not mean unscientific in any sense. The methods have a firm mathematical foundation and are reproducible every time. Next, we quickly review some of the most important statistical and machine learning models, and illustrate them through the datasets discussed earlier.
Logistic regression model
The logistic regression model is a binary classification model, and it is a member of the exponential family which belongs to the class of generalized linear models. Now, let denote the binary label:
Using the information contained in the explanatory vector we are trying to build a model that will help in this task. The logistic regression model is the following:
Here, is the vector of regression coefficients. Note that the logit function is linear in the regression coefficients and hence the name for the model is a logistic regression model. A logistic regression model can be equivalently written as follows:
Here, is the binary error term that follows a Bernoulli distribution. For more information, refer to Chapter 17 of Tattar, et al. (2016). The estimation of the parameters of the logistic regression requires the iterative reweighted least squares (IRLS) algorithm, and we would use the glm
R function to get this task done. We will use the Hypothyroid dataset in this section. In the previous section, the training and test datasets and formulas were already created, and we will carry on from that point.
Logistic regression for hypothyroid classification
For the hypothyroid
dataset, we had HT2_Train
as the training dataset. The test dataset is split as the covariate matrix in HT2_TestX
and the outputs of the test dataset in HT2_TestY
, while the formula for the logistic regression model is available in HT2_Formula
. First, the logistic regression model is fitted to the training dataset using the glm
function and the fitted model is christened LR_fit
, and then we inspect it for model fit summaries using summary(LR_fit)
. The fitted model is then applied to the covariate data in the test part using the predict
function to create LR_Predict
. The predicted probabilities are then labeled in LR_Predict_Bin
, and these labels are compared with the actual testY_numeric
and overall accuracy is obtained:
> ntr <- nrow(HT2_Train) # Training size > nte <- nrow(HT2_TestX) # Test size > p <- ncol(HT2_TestX) > testY_numeric <- as.numeric(HT2_TestY) > LR_fit <- glm(HT2_Formula,data=HT2_Train,family = binomial()) Warning message: glm.fit: fitted probabilities numerically 0 or 1 occurred > summary(LR_fit) Call: glm(formula = HT2_Formula, family = binomial(), data = HT2_Train) Deviance Residuals: Min 1Q Median 3Q Max -3.6390 0.0076 0.0409 0.1068 3.5127 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.302025 2.365804 -3.509 0.000449 *** Age -0.024422 0.012145 -2.011 0.044334 * GenderMALE -0.195656 0.464353 -0.421 0.673498 TSH -0.008457 0.007530 -1.123 0.261384 T3 0.480986 0.347525 1.384 0.166348 TT4 -0.089122 0.028401 -3.138 0.001701 ** T4U 3.932253 1.801588 2.183 0.029061 * FTI 0.197196 0.035123 5.614 1.97e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 609.00 on 1363 degrees of freedom Residual deviance: 181.42 on 1356 degrees of freedom AIC: 197.42 Number of Fisher Scoring iterations: 9 > LR_Predict <- predict(LR_fit,newdata=HT2_TestX,type="response") > LR_Predict_Bin <- ifelse(LR_Predict>0.5,2,1) > LR_Accuracy <- sum(LR_Predict_Bin==testY_numeric)/nte > LR_Accuracy [1] 0.9732704
It can be seen from the summary of the fitted GLM (the output following the line summary(LR_fit)
) that we are having four significant variables in Age
, TT4
, T4U
, and FTI
. Using the predict
function, we apply the fitted model on unknown test cases in HT2_TestX
, compare it with the actuals, and find the accuracy to be 97.33%. Consequently, logistic regression is easily deployed in the R software.
Neural networks
Logistic regression might appear restricted as it allows only a linear impact of the covariates through the link function. The linearity assumption might not hold, and in most practical cases, we don't have enough information to specify the functional form of the nonlinear relationship. Thus, all we know is that there is most likely an unknown nonlinear relationship. Neural networks are the nonlinear generalization of logistic regression, and this involves two important components: hidden neurons and learning rate. We will revise the structure of neural networks first.
In a neural network, the input variables are considered the first layer of neurons and the output the final and concluding layer of neurons. The structure of a neural network model can be visualized using the R package NeuralNetTools
. Suppose that we have three input variables and two hidden layers, and each contains two hidden neurons. Here, we have a neural network with four layers. The next code segment gives a visualization of a neural network's structure with three input variables, two hidden neurons in two hidden layers, and one output variable:
> library(NeuralNetTools) > plotnet(rep(0,17),struct=c(3,2,2,1)) > title("A Neural Network with Two Hidden Layers")
We find the R package NeuralNetTools
very useful in visualizing the structure of a neural network. Neural networks built using the core R package nnet
can also be visualized using the NeuralNetTools::plotnet
function. The plotnet
function sets up a neural network whose structure consists of three neurons in the first layer, two neurons in each of the second and third layers, and one in the final output layer, through the struct
option. The weights along the arcs are set at zero in rep(0,17)
:
In the previous diagram, we have four layers of the neural network. The first layer consists of B1 (the bias), I1 (X1), I2 (X2), and I3 (X3). The second layer consists of three neurons in B2 (the bias of the first hidden layer), H1, and H2. Note that the bias B2 does not receive any input from the first hidden layer. Next, each neuron receives an overall input from each of the neurons of the previous layer, which are B1, X1, X2, and X3 here. However, H1 and H2 of the first hidden layer will receive different aggregated input from B1, X1, X2, and X3. Appropriate weights are in action on each of the arcs of the network and it is the weights that form the parameters of the neural networks; that is, the arrival of H1 (of the first layer) would be like
and the effective arrival is through a transfer function. A transfer function might be an identity function, sigmoidal function, and so on. Similarly, the arrival at the second neuron of the first layer is
. By extension, B2, H1, and H2 (of the first layer) will be the input for the second hidden layer, and B3, H1, and H2 will be the input for the final output. At each stage of the neural network, we have weights. The weights need to be determined in such a manner that the difference between predicted output O1 and the true Y1 is as small as possible. Note that the logistic regression is a particular case of the neural network as can be seen by directly removing all hidden layers and input layer leads in the output one. The neural network will be fitted for the hypothyroid problem.
Neural network for hypothyroid classification
We use the nnet
function from the package of the same name to set up the neural network for the hypothyroid classification problem. The formula, training, and test datasets continue as before. The accuracy calculation follows along similar lines to the segment in logistic regression. The fitted neural network is visualized using the plotnet
graphical function from the NeuralNetTools
package:
> set.seed(12345) > NN_fit <- nnet(HT2_Formula,data = HT2_Train,size=p,trace=FALSE) > NN_Predict <- predict(NN_fit,newdata=HT2_TestX,type="class") > NN_Accuracy <- sum(NN_Predict==HT2_TestY)/nte > NN_Accuracy [1] 0.9827044025 > plotnet(NN_fit) > title("Neural Network for Hypothyroid Classification")
Here, the accuracy is 98.27%, which is an improvement on the logistic regression model. The visual display of the fitted model is given in the following diagram. We have fixed the seed for the random initialization of the neural network parameters at 12345
, using set.seed(12345)
, so that the results are reproducible at the reader's end. This is an interesting case for ensemble modeling. Different initial seeds – which the reader can toy around with – will lead to different accuracies. Sometimes, you will get an accuracy lower than any of the models considered in this section, and at other times you will get the highest accuracy. The choice of seed as arbitrary leads to the important question of which solution is useful. Since the seeds are arbitrary, the question of a good seed or a bad seed does not arise. In this case, if a model is giving you a higher accuracy, it does not necessarily mean anything:
Naïve Bayes classifier
The naïve Bayes classifier is a simplistic implementation based on the Bayes formula. It is based on simple empirical and conditional probabilities, as evidenced in the actual data. Beyond the simplest assumption of observation independence, we don't have any restrictions in using this model.
Naïve Bayes for hypothyroid classification
A naïve Bayes classifier is fit using the naiveBayes
function from the e1071
R package. The prediction and accuracy assessment is carried out using two functions, predict
and sum
:
> NB_fit <- naiveBayes(HT2_Formula,data=HT2_Train) > NB_predict <- predict(NB_fit,newdata=HT2_TestX) Warning message: In data.matrix(newdata) : NAs introduced by coercion > NB_Accuracy <- sum(NB_predict==HT2_TestY)/nte > NB_Accuracy [1] 0.9732704403
The accuracy of the naïve Bayes classifier is 97.33%, which is the same as the logistic regression model and less than the one provided by the neural network. We remark here that it is only a coincidence that the accuracy of this method and logistic regression is the same.
Decision tree
Breiman and Quinlan mainly developed decision trees, which have evolved a lot since the 1980s. If the dependent variable is continuous, the decision tree will be a regression tree and if it is categorical variable, it will be a classification tree. Of course, we can have a survival tree as well. Decision trees will be the main model that will be the beneficiary of the ensemble technique, as will be seen throughout the book.
Consider the regression tree given in the following diagram. We can see that there are three input variables, which are , and the output variable is Y. Strictly speaking, a decision tree will not display all the variables used to build the tree. In this tree structure, a decision tree is conventionally displayed upside down. We have four terminal nodes. If the condition is satisfied, we move to the right side of the tree and conclude that the average Y value is 40. If the condition is not satisfied, we move to the left, and check whether . If this condition is not satisfied, we move to the left side of the tree and conclude that the average Y value is 100. Upon the satisfactory meeting of this condition, we move to the right side and then if the categorical variable , the average Y value would be 250, or 10 otherwise. This decision tree can be captured in the form of an equation too, as follows:
The statistician Terry Therneau developed the rpart
R package.
Decision tree for hypothyroid classification
Using the rpart
function from the rpart
package, we build a classification tree for the same formula as the earlier partitioned data. The constructed tree can be visualized using the plot function, and the variable name is embossed on the tree with the text function. The equation of the fitted classification tree (see Figure Classification Tree for Hypothyroid) is the following:
Prediction and accuracy is carried out in a similar way as mentioned earlier:
> CT_fit <- rpart(HT2_Formula,data=HT2_Train) > plot(CT_fit,uniform=TRUE) > text(CT_fit) > CT_predict <- predict(CT_fit,newdata=HT2_TestX,type="class") > CT_Accuracy <- sum(CT_predict==HT2_TestY)/nte > CT_Accuracy [1] 0.9874213836
Consequently, the classification tree gives an accuracy of 98.74%, which is the best of the four models considered thus far. Next, we will consider the final model, support vector machines.
Support vector machines
Support vector machines, abbreviated popularly as SVM, are an important class of machine learning techniques. Theoretically, SVM can take an infinite number of features/covariates and build the appropriate classification or regression SVMs.
SVM for hypothyroid classification
The svm
function from the e1071
package will be useful for building an SVM
classifier on the Hypothyroid dataset. Following the usual practice, we have the following output in the R session:
> SVM_fit <- svm(HT2_Formula,data=HT2_Train) > SVM_predict <- predict(SVM_fit,newdata=HT2_TestX,type="class") > SVM_Accuracy <- sum(SVM_predict==HT2_TestY)/nte > SVM_Accuracy [1] 0.9842767296
The SVM technique gives us an accuracy of 98.43%, which is the second best of the models set up thus far.
In the next section, we will run each of the five classification models for the Waveform, German Credit, Iris, and Pima Indians Diabetes problem datasets.