Predicting infant survival
Finally, we can move to predicting the infants' survival chances. In this section, we will build two models: a linear classifier—the logistic regression, and a non-linear one—a random forest. For the former one, we will use all the features at our disposal, whereas for the latter one, we will employ a ChiSqSelector(...)
method to select the top four features.
Logistic regression in MLlib
Logistic regression is somewhat a benchmark to build any classification model. MLlib used to provide a logistic regression model estimated using a stochastic gradient descent (SGD) algorithm. This model has been deprecated in Spark 2.0 in favor of the LogisticRegressionWithLBFGS
model.
The LogisticRegressionWithLBFGS
model uses the Limited-memory Broyden–Fletcher–Goldfarb–Shanno (BFGS) optimization algorithm. It is a quasi-Newton method that approximates the BFGS algorithm.
Note
For those of you who are mathematically adept and interested in this, we suggest perusing this blog post...