Creating term frequency vectors
To calculate the Euclidean distance, let's first create a vector from our dictionary and document. This will allow us to easily compare the term frequencies between documents because they will occupy the same index of the vector.
(defn term-id [dict term] (get-in @dict [:terms term])) (defn term-frequencies [dict terms] (->> (map #(term-id dict %) terms) (remove nil?) (frequencies))) (defn map->vector [dictionary id-counts] (let [zeros (vec (replicate (:count @dictionary) 0))] (-> (reduce #(apply assoc! %1 %2) (transient zeros) id-counts) (persistent!)))) (defn tf-vector [dict document] (map->vector dict (term-frequencies dict document)))
The term-frequencies
function creates a map of term ID to frequency count for each term in the document. The map->vector
function simply takes this map and associates the frequency count at the index of the vector given by the term ID. Since there may be many terms...