Why NER is difficult?
Like many NLP tasks, NER is not always simple. Although the tokenization of a text will reveal its components, understanding what they are can be difficult. Using proper nouns will not always work because of the ambiguity of language. For example, Penny and Faith, while valid names, they may also be used for a measurement of currency and a belief, respectively. We can also find words such as Georgia are used as a name of a country, a state, and a person.
Some phrases can be challenging. The phrase "Metropolitan Convention and Exhibit Hall" may contain words that in themselves are valid entities. When the domain is well known, a list of entities can be very useful and easy to implement.
NER is typically applied at the sentence level, otherwise a phrase can easily span a sentence leading to incorrect identification of an entity. For example, in the following sentence:
"Bob went south. Dakota went west."
If we ignored the sentence boundaries, then we could inadvertently find...