Classification-based chunking
Unlike most part-of-speech taggers, the ClassifierBasedTagger
class learns from features. That means we can create a ClassifierChunker
class that can learn from both the words and part-of-speech tags, instead of only the part-of-speech tags as the TagChunker
class does.
How to do it...
For the
ClassifierChunker
class, we don't want to discard the words from the training sentences as we did in the previous recipe. Instead, to remain compatible with the 2-tuple (word, pos)
format required for training a ClassiferBasedTagger
class, we convert the (word, pos, iob)
3-tuples from tree2conlltags()
into ((word, pos), iob)
2-tuples using the chunk_trees2train_chunks()
function. This code can be found in chunkers.py
:
from nltk.chunk import ChunkParserI from nltk.chunk.util import tree2conlltags, conlltags2tree from nltk.tag import ClassifierBasedTagger def chunk_trees2train_chunks(chunk_sents): tag_sents = [tree2conlltags(sent) for sent in chunk_sents] return [[((w...