Finding people, places, and things with Named Entity Recognition
One thing that's fairly easy to pull out of documents is named items. This includes things such as people's names, organizations, locations, and dates. These algorithms are called Named Entity Recognition (NER), and while they are not perfect, they're generally pretty good. Error rates under 0.1 are normal.
The OpenNLP library has classes to perform NER, and depending on what you train them with, they will identify people, locations, dates, or a number of other things. The clojure-opennlp library also exposes these classes in a good, Clojure-friendly way.
Getting ready
We'll continue building on the previous recipes in this chapter. Because of this, we'll use the same project.clj
file:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [clojure-opennlp "0.3.2"]])
From the Tokenizing text recipe, we'll...