Implementing k-means using H2O over Spark
In this recipe, we'll look at how to run a k-means clustering algorithm on a dataset of figures concerning prostate cancer. Please download the dataset from https://github.com/ChitturiPadma/datasets/blob/master/prostate.csv. This is prostate cancer data that came from a study that examined the correlation between the level of prostate-specific antigen and a number of other clinical measures in men.
Getting ready
To step through this recipe, you will need a running Spark Cluster in any one of the following modes: Local, standalone, YARN, Mesos. Include the Spark MLlib package in the build.sbt
file so that it downloads the related libraries and the API can be used. Install Hadoop (optionally), Scala, and Java. Also, install Sparkling Water as discussed in the preceding recipe.
How to do it…
The sample rows in the
prostate.csv
look like the following:Here is the code to run k-means on the preceding dataset:
import org.apache.spark._ ...