Coding with spaCy
Since the previously mentioned text preprocessing steps are fundamental to NLP, the NLP community has long sensed the demand for an open source library to benefit more researchers. Thus, spaCy was developed and open sourced. It is designed particularly for production use. Researchers can build applications that process massive volumes of text efficiently. Its NLP pipeline handles all the assigned NLP tasks and then stores the results as attributes to each tokenized word.
Figure 3.1 shows how the nlp()
pipeline of spaCY works. It takes the raw text, tokenizes the text with its tokenizer, tags each tokenized word with its tagger, and so on. The results are stored as attributes:
tes:
Figure 3.1 – The spaCy pipeline
Let’s see what they are:
tokenizer
: This tokenizes the text and turns a string of text into an NLP object.tagger
andparser
: This assigns part-of-speech (PoS) tags and dependency labels. The PoS...