Extracting proper noun chunks
A simple way to do named entity extraction is to chunk all proper nouns (tagged with NNP
). We can tag these chunks as NAME
, since the definition of a proper noun is the name of a person, place, or thing.
How to do it...
Using the RegexpParser
class, we can create a very simple grammar that combines all proper nouns into a NAME
chunk. Then, we can test this on the first tagged sentence of treebank_chunk
to compare the results with the previous recipe:
>>> chunker = RegexpParser(r''' ... NAME: ... {<NNP>+} ... ''') >>> sub_leaves(chunker.parse(treebank_chunk.tagged_sents()[0]), 'NAME') [[('Pierre', 'NNP'), ('Vinken', 'NNP')], [('Nov.', 'NNP')]]
Although we get Nov.
as a NAME
chunk, this isn't a wrong result, as Nov.
is the name of a month.
How it works...
The NAME
chunker is a simple usage of the RegexpParser
class, covered in the Chunking and chinking with regular expressions, Merging and splitting chunks with regular expressions, and Partial...