LabeledPoint data structure for Spark ML
LabeledPoint is a data structure that has been around since the early days for packaging a feature vector along with a label so it can be used in unsupervised learning algorithms. We demonstrate a short recipe that uses LabeledPoint, the Seq data structure, and DataFrame to run a logistic regression for binary classification of the data.
How to do it...
- Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
- Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
- Import the necessary packages for SparkContext to get access to the cluster:
import org.apache.spark.ml.feature.LabeledPoint import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.sql._
- Create Spark's configuration and SparkContext so we can have access to the cluster:
val spark = SparkSession .builder .master("local[*]") .appName...