Merging and splitting tokens
We extracted the name entities in the previous section, but how about if we want to unite or split multiword named entities? And what if the tokenizer performed this not so well on some exotic tokens and you want to split them by hand? In this subsection, we'll cover a very practical remedy for our multiword expressions, multiword named entities, and typos.
doc.retokenize
is the correct tool for merging and splitting the spans. Let's see an example of retokenization by merging a multiword named entity, as follows:
doc = nlp("She lived in New Hampshire.") doc.ents (New Hampshire,) [(token.text, token.i) for token in doc] [('She', 0), ('lived', 1), ('in', 2), ('New', 3), ('Hampshire', 4), ('.', 5)] len(doc) 6 with doc.retokenize() as retokenizer: retokenizer.merge(doc[3:5], \ attrs={"LEMMA": "new hampshire...