Scaling document frequencies by document size
While raw token frequencies can be useful, they often have one major problem: comparing frequencies with different documents is complicated if the document sizes are not the same. If the word customer appears 23 times in a 500-word document and it appears 40 times in a 1,000-word document, which one do you think is more focused on that word? It's difficult to say.
To work around this, it's common to scale the tokens frequencies for each document by the size of the document. That's what we'll do in this recipe.
Getting ready
We'll continue building on the previous recipes in this chapter. Because of that, we'll use the same project.clj
file:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [clojure-opennlp "0.3.2"]])
We'll use the token frequencies that we figured from the Getting document frequencies recipe. We...