Introduction
Text classification is a way to categorize documents or pieces of text. By examining the word usage in a piece of text, classifiers can decide what class label to assign to it. A binary classifier decides between two labels, such as positive or negative. The text can either be one label or another, but not both, whereas a multi-label classifier can assign one or more labels to a piece of text.
Classification works by learning from
labeled feature sets, or training data, to later classify an
unlabeled feature set. A labeled feature set is simply a tuple that looks like (feat, label)
, while an unlabeled feature set is a feat
by itself. A feature set is basically a key-value mapping of feature names to feature values. In the case of text classification, the feature names are usually words, and the values are all True
. As the documents may have unknown words, and the number of possible words may be very large, words that don't occur in the text are omitted, instead of including...