Extracting location chunks
To identify LOCATION
chunks, we can make a different kind of ChunkParserI
subclass that uses the gazetteers
corpus to identify location words. The gazetteers
corpus is a WordListCorpusReader
class that contains the following location words:
Country names
U.S. states and abbreviations
Major U.S. cities
Canadian provinces
Mexican states
How to do it...
The
LocationChunker
class, found in chunkers.py
, iterates over a tagged sentence looking for words that are found in the gazetteers
corpus. When it finds one or more location words, it creates a LOCATION
chunk using IOB tags. The helper method iob_locations()
is where the IOB LOCATION
tags are produced, and the parse()
method converts these IOB tags into a Tree
:
from nltk.chunk import ChunkParserI from nltk.chunk.util import conlltags2tree from nltk.corpus import gazetteers class LocationChunker(ChunkParserI): def __init__(self): self.locations = set(gazetteers.words()) self.lookahead = 0 for loc in self.locations...