Building a clustering models
Often, it is hard to get our hands on data that is labeled. Also, sometimes you might want to find underlying patterns in your dataset. In this recipe, we will learn how to build the popular k-means clustering model in Spark.
Getting ready
To execute this recipe, you need to have a working Spark environment. You should have already gone through the Standardizing the data recipe where we standardized the encoded census data.
No other prerequisites are required.
How to do it...
Just like with classification or regression models, building clustering models is pretty straightforward in Spark. Here's the code that aims to find patterns in the census data:
import pyspark.mllib.clustering as clu model = clu.KMeans.train(
final_data.map(lambda row: row[1]) , 2 , initializationMode='random' , seed=666 )
How it works...
First, we need to import the clustering submodule of MLlib. Just like before, we first create the clustering estimator object, KMeans
. The .train...