Spark's basic statistical API to help you build your own algorithms
In this recipe, we cover Spark's multivariate statistical summary (that is, Statistics.colStats) such as correlation, stratified sampling, hypothesis testing, random data generation, kernel density estimators, and much more, which can be applied to extremely large datasets while taking advantage of both parallelism and resiliency via RDDs.
How to do it...
- Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
- Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
- Import the necessary packages for the Spark session to gain access to the cluster and
log4j.Logger
to reduce the amount of output produced by Spark:
import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.stat.Statistics import org.apache.spark.sql.SparkSession import org.apache.log4j.Logger import org.apache.log4j.Level
- Set the output level to
ERROR
to...