There are a number of ways to categorize parts of text. For example, we may be concerned with character-level issues, such as punctuation, with a possible need to ignore or expand contractions. At the word level, we may need to perform different operations, such as the following:
- Identifying morphemes using stemming and/or lemmatization
- Expanding abbreviations and acronyms
- Isolating number units
We cannot always split words with punctuation, because the punctuation is sometimes considered to be part of the word, such as the word can't. We may also be concerned with grouping multiple words to form meaningful phrases. Sentence-detection can also be a factor. We do not necessarily want to group words that cross sentence boundaries.
In this chapter, we are primarily concerned with the tokenization process and a few specialized techniques, such...