Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
In this recipe, we will explore Spark's implementation of expectation maximization (EM) GaussianMixture()
, which calculates the maximum likelihood given a set of features as input. It assumes a Gaussian mixture in which each point can be sampled from K number of sub-distributions (cluster memberships).
How to do it...
Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
Set up the package location where the program will reside:
package spark.ml.cookbook.chapter8.
- Import the necessary packages for vector and matrix manipulation:
import org.apache.log4j.{Level, Logger} import org.apache.spark.mllib.clustering.GaussianMixture import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.sql.SparkSession
Create Spark's session object:
val spark = SparkSession .builder .master("local[*]") .appName("myGaussianMixture") .config("spark.sql.warehouse.dir", ...