Streaming logistic regression for an on-line classifier
In this recipe, we will be using the Pima Diabetes dataset we downloaded in the previous recipe and Spark's streaming logistic regression algorithm with SGD to predict whether a Pima with various features will test positive as a diabetic. It is an on-line classifier that learns and predicts based on the streamed data.
How to do it...
- Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
- Set up the package location where the program will reside:
package spark.ml.cookbook.chapter13
- Import the necessary packages:
import org.apache.log4j.{Level, Logger} import org.apache.spark.mllib.classification.StreamingLogisticRegressionWithSGD import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.rdd.RDD import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.streaming.{Seconds, StreamingContext} import scala.collection...