Clustering forest cover types
Clustering is an unsupervised family of methods that attempts to find patterns in data without any indication of what a class might be. In other words, the clustering methods find commonalities between records and groups them into clusters, depending on how similar they are to each other, and how dissimilar they are from those found in other clusters.
In this recipe, we will build the most fundamental model of them all—the k-means.
Getting ready
To execute this recipe, you will need a working Spark environment and you would have already loaded the data into the forest
DataFrame.
No other prerequisites are required.
How to do it...
The process of building a clustering model in Spark does not deviate significantly from what we have already seen in either the classification or regression examples:
import pyspark.ml.clustering as clust vectorAssembler = feat.VectorAssembler( inputCols=forest.columns[:-1] , outputCol='features') kmeans_obj = clust.KMeans(k=7,...