Calculating relative values
One way to normalize values is to scale frequencies by the sizes of their groups. For example, say the word truth appears three times in a document. This means one thing if the document has thirty words. It means something else if the document has 300 or 3,000 words. Moreover, if the dataset has documents of all these lengths, how do you compare the frequencies for words across documents?
One way to do this is to rescale the frequency counts. In some cases, we can just scale the terms by the length of the documents. Or, if we want better results, we might use something more complicated such as term frequency-inverse document frequency (TF-IDF).
For this recipe, we'll rescale some term frequencies by the total word count for their document.
Getting ready
We don't need much for this recipe. We'll use the minimal project.clj
file, which is listed here:
(defproject cleaning-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"]])
However, it will be easier...