Understanding the parts of text
There are a number of ways of categorizing parts of text. For example, we may be concerned with character-level issues such as punctuations with a possible need to ignore or expand contractions. At the word level, we may need to perform different operations such as:
Identifying morphemes using stemming and/or lemmatization
Expanding abbreviations and acronyms
Isolating number units
We cannot always split words with punctuations because the punctuations are sometimes considered to be part of the word, such as the word "can't". We may also be concerned with grouping multiple words to form meaningful phrases. Sentence detection can also be a factor. We do not necessarily want to group words that cross sentence boundaries.
In this chapter, we are primarily concerned with the tokenization process and a few specialized techniques such as stemming. We will not attempt to show how they are used in other NLP tasks. Those efforts are reserved for later chapters.