Additional NLP and network considerations
This has been a marathon of a chapter. Please bear with me a little longer. I have a few final thoughts that I’d like to express, and then we can conclude this chapter.
Data cleanup
First, if you work with language data, there will always be cleanup. Language is messy and difficult. If you are only comfortable working with pre-cleaned tabular data, this is going to feel very messy. I love that, as every project allows me to improve my techniques and tactics.
I showed two different approaches for extracting entities: PoS tagging and NER. Both approaches work very well, but consider which approach gets us closer to a clean and useful entity list the quickest and easiest. With PoS tagging
, we get one token at a time. With NER, we very quickly get to entities, but the models occasionally misbehave or don’t catch everything, so there is always cleanup with this as well.
There is no silver bullet. I want to use whatever...