Extracting named entities
Named entity recognition is a specific kind of chunk extraction that uses entity tags instead of, or in addition to, chunk tags. Common entity tags include PERSON
, ORGANIZATION
, and LOCATION
. Part-of-speech tagged sentences are parsed into chunk trees as with normal chunking, but the labels of the trees can be entity tags instead of chunk phrase tags.
How to do it...
NLTK comes with a pre-trained named entity chunker. This chunker has been trained on data from the ACE program,
National Institute of Standards and Technology (NIST) sponsored program for Automatic Content Extraction, which you can read more about at http://www.itl.nist.gov/iad/894.01/tests/ace/. Unfortunately, this data is not included in the NLTK corpora, but the trained chunker is. This chunker can be used through the ne_chunk()
method in the nltk.chunk
module. The ne_chunk()
method will chunk a single sentence into a Tree
. The following is an example using ne_chunk()
on the first tagged sentence...