Better clustering with TF-IDF
Term Frequency-Inverse Document Frequency (TF-IDF) is a general approach to weighting terms within a document vector so that terms that are popular across the whole dataset are not weighted as highly as terms that are less usual. This captures the intuitive conviction—and what we observed earlier—that words such as "said" are not a strong basis for building clusters.
Zipf's law
Zipf's law states that the frequency of any word is inversely proportional to its rank in the frequency table. Thus, the most frequent word will occur approximately twice as often as the second most frequent word and three times as often as the next most frequent word, and so on. Let's see if this applies across our Reuters corpus:
(defn ex-6-13 [] (let [documents (fs/glob "data/reuters-text/*.txt") doc-count 1000 top-terms 25 term-frequencies (->> (map slurp documents) (remove...