Building and evaluating NER systems
Based on our discussion so far in this chapter, we know that building an NER system will start with the following steps:
Separate our document into sentences.
Separate our sentences into tokens.
Tag each token with a part of speech.
Identify named entities from this tagged token set.
Identify the class of each named entity.
To help us correctly find tokens at step 2, separate the real named entities from the impostors at step 4, and to ensure that the entities are placed into the correct class at step 5, it is common to leverage a machine learning approach, similar to what NLTK and its sentiment mining functions did for us in Chapter 5, Sentiment Analysis in Text. Relying on a large set of pre-classified examples will help us work out some of the more complicated issues we introduced above for recognizing named entities, for example, choosing the correct boundary in multi-word noun phrases, or recognizing novel approaches to capitalization, or knowing what kind...