Chapter 7. Text Classification
In this chapter, we will cover the following recipes:
- Bag of words feature extraction
- Training a Naive Bayes classifier
- Training a decision tree classifier
- Training a maximum entropy classifier
- Training scikit-learn classifiers
- Measuring precision and recall of a classifier
- Calculating high information words
- Combining classifiers with voting
- Classifying with multiple binary classifiers
- Training a classifier with NLTK-Trainer
Introduction
Text classification is a way to categorize documents or pieces of text. By examining the word usage in a piece of text, classifiers can decide what class label to assign to it. A binary classifier decides between two labels, such as positive or negative. The text can either be one label or another, but not both, whereas a multi-label classifier can assign one or more labels to a piece of text.
Classification works by learning from labeled feature sets, or training data, to later classify an unlabeled feature set. A labeled...