Ridge regression with SGD optimization in Spark 2.0
In this recipe, we use admission data from the UCI Library Repository to build and then train a model to predict student admission using the RDD-based LogisticRegressionWithSGD()
Apache Spark API set. We use a given set of features (GRE, GPA, and Rank) used during the admission to predict model weights using ridge regression. We demonstrate the input feature standardization in a recipe, but it should be noted that parameter standardization has an important effect on the results, especially in a ridge regression setting.
Spark's ridge regression API (LogisticRegressionWithSGD
) is meant to deal with multicollinearity (the explanatory variable or features are correlated and the assumption of intendent and randomly distributed feature variables are somewhat flawed). Ridge is about shrinking (penalizing via L2 regularization or a quadratic function) some of the parameters, therefore reducing their effect and in turn reducing complexity. It...