Scaling document frequencies with TF-IDF
In the last few recipes, we've seen how to generate term frequencies and scale them by the size of the document so that the frequencies from two different documents can be compared.
Term frequencies also have another problem. They don't tell you how important a term is, relative to all of the documents in the corpus.
To address this, we will use term frequency-inverse document frequency (TF-IDF). This metric scales the term's frequency in a document by the term's frequency in the entire corpus.
In this recipe, we'll assemble the parts needed to implement TF-IDF.
Getting ready
We'll continue building on the previous recipes in this chapter. Because of that, we'll use the same project.clj
file:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [clojure-opennlp "0.3.2"]])
We'll also use two functions that we've created...