Packt+ | Advance your knowledge in tech

You're reading from Natural Language Processing with Java Cookbook Over 70 recipes to create linguistic and language translation applications using Java libraries

Product type Paperback

Published in Apr 2019

Publisher Packt

ISBN-13 9781789801156

Length 386 pages

Edition 1st Edition

Languages

Java

Tools

Deeplearning4j

Concepts

Mobile Application Development

Authors (2):

Richard M. Reese

Richard M Reese

View More author details

One of the first steps required for Natural Language Processing (NLP) is the extraction of tokens in text. The process of tokenization splits text into tokens—that is, words. Normally, tokens are split based upon delimiters, such as white space. White space includes blanks, tabs, and carriage-return line feeds. However, specialized tokenizers can split tokens according to other delimiters. In this chapter, we will illustrate several tokenizers that you will find useful in your analysis.

Another important NLP task involves determining the stem and lexical meaning of a word. This is useful for deriving more meaning about the words beings processed, as illustrated in the fifth and sixth recipe. The stem of a word refers to the root of a word. For example, the stem of the word antiquated is antiqu. While this may not seem to be the correct stem, the stem of a word is the ultimate base of the word.

The lexical meaning of a word is not concerned with the context in which it is being used. We will be examining the process of performing lemmatization of a word. This is also concerned with finding the root of a word, but uses a more detailed dictionary to find the root. The stem of a word may vary depending on the form the word takes. However, with lemmatization, the root will always be the same. Stemming is often used when we will be satisfied with possibly a less than precise determination of the root of a word. A more thorough discussion of stemming versus lemmatization can be found at: https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/.

The last task in this chapter deals with the process of text normalization. Here, we are concerned with converting the token that is extracted to a form that can be more easily processed during later analysis. Typical normalization activities include converting cases, expanding abbreviations, removing stop words along with stemming, and lemmatization. Stop words are those words that can often be ignored with certain types of analyses. For example, in some contexts, the word the does not always need to be included.

In this chapter, we will cover the following recipes:

Tokenization using the Java SDK
Tokenization using OpenNLP
Tokenization using maximum entropy
Training a neural network tokenizer for specialized text
Identifying the stem of a word
Training an OpenNLP lemmatization model
Determining the lexical meaning of a word using OpenNLP
Removing stop words using LingPipe

You're reading from Natural Language Processing with Java Cookbook Over 70 recipes to create linguistic and language translation applications using Java libraries

Table of Contents (14) Chapters

Preparing Text for Analysis and Tokenization

Authors (2)

Other recommended products