Chapter 10. Working with Unstructured and Textual Data
In this chapter, we will cover the following recipes:
- Tokenizing text
- Finding sentences
- Focusing on content words with stoplists
- Getting document frequencies
- Scaling document frequencies by document size
- Scaling document frequencies with TF-IDF
- Finding people, places, and things with Named Entity Recognition
- Mapping documents to a sparse vector space representation
- Performing topic modeling with MALLET
- Performing naïve Bayesian classification with MALLET