Training a tagger-based chunker
Training a chunker can be a great alternative to manually specifying regular expression chunk patterns. Instead of a pain-staking process of trial and error to get the exact right patterns, we can use existing corpus data to train chunkers much like we did for part-of-speech tagging in the previous chapter.
How to do it...
As with the part-of-speech tagging, we'll use the treebank
corpus data for training. But this time, we'll use the treebank_chunk
corpus, which is specifically formatted to produce chunked sentences in the form of trees. These chunked_sents()
methods will be used by a TagChunker
class to train a tagger-based chunker. The TagChunker
class uses a helper function, conll_tag_chunks()
, to extract a list of (pos, iob)
tuples from a list of Trees
. These (pos, iob)
tuples are then used to train a tagger in the same way (word, pos)
tuples were used in Chapter 4, Part-of-speech Tagging, to train part-of-speech taggers. But instead of learning part-of...