One of the first steps required for Natural Language Processing (NLP) is the extraction of tokens in text. The process of tokenization splits text into tokens—that is, words. Normally, tokens are split based upon delimiters, such as white space. White space includes blanks, tabs, and carriage-return line feeds. However, specialized tokenizers can split tokens according to other delimiters. In this chapter, we will illustrate several tokenizers that you will find useful in your analysis.
Another important NLP task involves determining the stem and lexical meaning of a word. This is useful for deriving more meaning about the words beings processed, as illustrated in the fifth and sixth recipe. The stem of a word refers to the root of a word. For example, the stem of the word antiquated is antiqu. While this may not seem to be the correct stem, the stem of a word is the ultimate base of the word.
The lexical meaning of a word is not concerned with the context in which it is being used. We will be examining the process of performing lemmatization of a word. This is also concerned with finding the root of a word, but uses a more detailed dictionary to find the root. The stem of a word may vary depending on the form the word takes. However, with lemmatization, the root will always be the same. Stemming is often used when we will be satisfied with possibly a less than precise determination of the root of a word. A more thorough discussion of stemming versus lemmatization can be found at: https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/.
The last task in this chapter deals with the process of text normalization. Here, we are concerned with converting the token that is extracted to a form that can be more easily processed during later analysis. Typical normalization activities include converting cases, expanding abbreviations, removing stop words along with stemming, and lemmatization. Stop words are those words that can often be ignored with certain types of analyses. For example, in some contexts, the word the does not always need to be included.
In this chapter, we will cover the following recipes:
- Tokenization using the Java SDK
- Tokenization using OpenNLP
- Tokenization using maximum entropy
- Training a neural network tokenizer for specialized text
- Identifying the stem of a word
- Training an OpenNLP lemmatization model
- Determining the lexical meaning of a word using OpenNLP
- Removing stop words using LingPipe