So far, the data we have dealt with has either been table data with columns as features or image data with pixels as features. In the case of text, things are less obvious. Shall we use sentences, words, or characters as our features? Sentences are very specific. For example, it is very unlikely to have the exact same sentence appearing in two or more Wikipedia articles. Therefore, if we use sentences as features, we will end up with tons of features that do not generalize well.
Characters, on the other hand, are limited. For example, there are only 26 letters in the English language. This small variety is likely to limit the ability of the separate characters to carry enough information for the downstream algorithms to extract. As a result, words are typically used as features for most tasks.
Later in this chapter, we will see that fairly...