Calculating high information words
A high information word is a word that is strongly biased towards a single classification label. These are the kinds of words we saw when we called the show_most_informative_features()
method on both the NaiveBayesClassifier
class and the MaxentClassifier
class. Somewhat surprisingly, the top words are different for both classifiers. This discrepancy is due to how each classifier calculates the significance of each feature, and it's actually beneficial to have these different methods as they can be combined to improve accuracy, as we will see in the next recipe, Combining classifiers with voting.
The low information words are words that are common to all labels. It may be counter-intuitive, but eliminating these words from the training data can actually improve accuracy, precision, and recall. The reason this works is that using only high information words reduces the noise and confusion of a classifier's internal model. If all the words/features are highly...