Summary
In this chapter, we've learned about the process of clustering and covered the popular k-means clustering algorithm to cluster large numbers of text documents.
This provided an opportunity to cover the specific challenges presented by text processing where data is often messy, ambiguous, and high-dimensional. We saw how both stop words and stemming can help to reduce the number of dimensions and how TF-IDF can help identify the most important dimensions. We also saw how n-grams and shingling can help to tease out context for each word at the cost of a vast proliferation of terms.
We've explored Parkour in greater detail and seen how it can be used to write sophisticated, scalable, Hadoop jobs. In particular, we've seen how to make use of the distributed cache and custom tuple schemas to write Hadoop job process data represented as Clojure data structures. We used both of these to implement a method for generating unique, cluster-wide term IDs.
Finally, we witnessed...