In this recipe, we cover Spark's multivariate statistical summary (that is, Statistics.colStats) such as correlation, stratified sampling, hypothesis testing, random data generation, kernel density estimators, and much more, which can be applied to extremely large datasets while taking advantage of both parallelism and resiliency via RDDs.
Spark's basic statistical API to help you build your own algorithms
How to do it...
- Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
- Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
- Import the necessary packages for the Spark session to gain access to the cluster and...