An example of a document clustering application
This application will read a set of documents and will organize them using the k-means clustering algorithms. To achieve this, we will use four components:
- The Reader system: This system will read all the documents and convert every document into a list of
String
objects. - The Indexer system: This system will process the documents and convert them into a list of words. At the same time, it will generate the global vocabulary of the set of documents with all the words that appear on them.
- The Mapper system: This system will convert each list of words into a mathematical representation using the vector space model. The value of each item will be the Tf-Idf (short for term frequency–inverse document frequency) metric.
- The Clustering system: This system will use the k-means clustering algorithm to cluster the documents.
All these systems are concurrent and use their own tasks to implement their functionality. Let's see how you can implement...