Clustering data with k-means algorithm
The k-means clustering algorithm is likely the most widely known data mining technique for clustering vectorized data. It aims at partitioning the observations into discrete clusters based on the similarity between them; the deciding factor is the Euclidean distance between the observation and centroid of the nearest cluster.
Getting ready
To run this recipe, you need pandas
and Scikit
. No other prerequisites are required.
How to do it…
Scikit
offers several clustering models in its cluster submodule. Here, we will use .KMeans(...)
to estimate our clustering model (the clustering_kmeans.py
file):
def findClusters_kmeans(data): ''' Cluster data using k-means ''' # create the classifier object kmeans = cl.KMeans( n_clusters=4, n_jobs=-1, verbose=0, n_init=30 ) # fit the data return kmeans.fit(data)
How it works…
Just like in the previous chapter (and in all the recipes that follow), we start...