Tokenizing text
Before we can do any real analysis of a text or a corpus of texts, we have to identify the words in the text. This process is called tokenization. The output of this process is a list of words, and possibly includes punctuation in a text. This is different from tokenizing formal languages such as programming languages: it is meant to work with natural languages and its results are less structured.
It's easy to write your own tokenizer, but there are a lot of edge and corner cases to take into consideration and account for. It's also easy to include a natural language processing (NLP) library that includes one or more tokenizers. In this recipe, we'll use the OpenNLP (http://opennlp.apache.org/) and its Clojure wrapper (https://clojars.org/clojure-opennlp).
Getting ready
We'll need to include the clojure-opennlp
in our project.clj
file:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"...