Clustering text
Clustering is the process of finding groups of objects that are similar to each other. The goal is that objects within a cluster should be more similar to each other than to objects in other clusters. Like classification, it is not a specific algorithm so much as a general class of algorithms that solve a general problem.
Although there are a variety of clustering algorithms, all rely to some extent on a distance measure. For an algorithm to determine whether two objects belong in the same or different clusters it must be able to determine a quantitative measure of the distance (or, if you prefer, the similarity) between them. This calls for a numeric measure of distance: the smaller the distance, the greater the similarity between two objects.
Since clustering is a general technique that can be applied to diverse data types, there are a large number of possible distance measures. Nonetheless, most data can be represented by one of a handful of common abstractions: a set,...