Merging and splitting chunks with regular expressions
In this recipe, we'll cover two more rules for chunking. A MergeRule
class can merge two chunks together based on the end of the first chunk and the beginning of the second chunk. A SplitRule
class will split a chunk into two chunks based on the specified split pattern.
How to do it...
A SplitRule
class is specified with two opposing curly braces surrounded by a pattern on either side. To split a chunk after a noun, you would do <NN.*>}{<.*>
. A MergeRule
class is specified by flipping the curly braces, and will join chunks where the end of the first chunk matches the left pattern and the beginning of the next chunk matches the right pattern. To merge two chunks where the first ends with a noun and the second begins with a noun, you'd use <NN.*>{}<NN.*>
.
Note
Note that the order of rules is very important, and reordering can affect the results. The RegexpParser
class applies the rules one at a time from top to bottom...