Getting started with data preparation
In the previous chapters, we saw how to make the best of spaCy's pre-trained statistical models (including the POS tagger, NER, and dependency parser) in our applications. In this chapter, we will see how to customize the statistical models for our custom domain and data.
spaCy models are very successful for general NLP purposes, such as understanding a sentence's syntax, splitting a paragraph into sentences, and extracting some entities. However, sometimes, we work on very specific domains that spaCy models didn't see during training.
For example, the Twitter text contains many non-regular words, such as hashtags, emoticons, and mentions. Also, Twitter sentences are usually just phrases, not full sentences. Here, it's entirely reasonable that spaCy's POS tagger performs in a substandard manner as the POS tagger is trained on full, grammatically correct English sentences.
Another example is the medical domain...