Mapping documents to a sparse vector space representation
Many text algorithms deal with vector space representations of the documents. This means that the documents are normalized into vectors. Each individual token type is assigned one position across all the documents' vectors. For instance, text might have position 42, so index 42 in all the document vectors will have the frequency (or other value) of the word text.
However, most documents won't have anything for most words. This makes them sparse vectors, and we can use more efficient formats for them.
The Colt library (http://acs.lbl.gov/ACSSoftware/colt/) contains implementations of sparse vectors. For this recipe, we'll see how to read a collection of documents into these.
Getting ready…
For this recipe, we'll need the following in our project.clj
file:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [clojure-opennlp...