Converting a chunk tree to text
At some point, you may want to convert a Tree
or subtree back to a sentence or chunk string. This is mostly straightforward, except when it comes to properly outputting punctuation.
How to do it...
We'll use the first tree of the treebank_chunk
corpus as our example. The obvious first step is to join all the words in the tree with a space:
>>> from nltk.corpus import treebank_chunk >>> tree = treebank_chunk.chunked_sents()[0] >>> ' '.join([w for w, t in tree.leaves()]) 'Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .'
But as you can see, the punctuation isn't quite right. The commas and period are treated as individual words, and so get the surrounding spaces as well. But we can fix this using regular expression substitution. This is implemented in the chunk_tree_to_sent()
function found in transforms.py
:
import re punct_re = re.compile(r'\s([,\.;\?])') def chunk_tree_to_sent(tree, concat=' ')...