Clustering text data using K-Means
In this recipe, we are going to take a look at how to use Mahout to cluster text data using Mahout's implementation of the K-Means algorithm. K-Means is very popular clustering algorithm; you can read more about it at https://en.wikipedia.org/wiki/K-means_clustering.
Getting ready
To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Mahout installed on it.
How to do it...
In this recipe, we are going to use Mahout's K Means algorithm to cluster the text data that is available. To do this, we first need to get some text data and copy it to HDFS:
hadoop fs –mkdir /kmeans hadoop fs –put mydata.txt /kmeans/input
In order to execute the K-Means job on the given data, we first need to convert it into sequential files and from these sequential files to TF-IDF vectors. Mahout provides built-in utilities to perform these actions. The following are the commands to do this.
To convert text data into a sequential file, here is...