Logistic regression with SGD optimization in Spark 2.0
In this recipe, we use admission data the UCI Machine Library Repository to build and then train a model to predict student admissions based on a given set of features (GRE, GPA, and Rank) used during the admission process using the RDD-based LogisticRegressionWithSGD()
Apache Spark API set.
This recipe demonstrates both optimization (SGD) and regularization (penalizing the model for complexity or over-fitting). We emphasize that they are two different things and often cause confusion to beginners. In the upcoming chapter, we demonstrate both concepts in more detail since understanding both is fundamental to a successful study of ML.
How to do it...
- We use the dataset from the UCLA Institute for Digital ResearchandEducation (IDRE). You can download the entire dataset from the following URLs:
- For home page, you can refer to http://www.ats.ucla.edu/stat/
- For data file, you can refer to https://stats.idre.ucla.edu/stat/data/binary.csv
The...