Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Natural Language Processing with Java Cookbook
Natural Language Processing with Java Cookbook

Natural Language Processing with Java Cookbook: Over 70 recipes to create linguistic and language translation applications using Java libraries

Arrow left icon
Profile Icon Richard M Reese Profile Icon Richard M. Reese
Arrow right icon
€32.99
Paperback Apr 2019 386 pages 1st Edition
eBook
€17.99 €26.99
Paperback
€32.99
Subscription
Free Trial
Renews at €18.99p/m
Arrow left icon
Profile Icon Richard M Reese Profile Icon Richard M. Reese
Arrow right icon
€32.99
Paperback Apr 2019 386 pages 1st Edition
eBook
€17.99 €26.99
Paperback
€32.99
Subscription
Free Trial
Renews at €18.99p/m
eBook
€17.99 €26.99
Paperback
€32.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Table of content icon View table of contents Preview book icon Preview Book

Natural Language Processing with Java Cookbook

Preparing Text for Analysis and Tokenization

One of the first steps required for Natural Language Processing (NLP) is the extraction of tokens in text. The process of tokenization splits text into tokens—that is, words. Normally, tokens are split based upon delimiters, such as white space. White space includes blanks, tabs, and carriage-return line feeds. However, specialized tokenizers can split tokens according to other delimiters. In this chapter, we will illustrate several tokenizers that you will find useful in your analysis.

Another important NLP task involves determining the stem and lexical meaning of a word. This is useful for deriving more meaning about the words beings processed, as illustrated in the fifth and sixth recipe. The stem of a word refers to the root of a word. For example, the stem of the word antiquated is antiqu. While this may not seem to be the correct stem, the stem of a word is the ultimate base of the word.

The lexical meaning of a word is not concerned with the context in which it is being used. We will be examining the process of performing lemmatization of a word. This is also concerned with finding the root of a word, but uses a more detailed dictionary to find the root. The stem of a word may vary depending on the form the word takes. However, with lemmatization, the root will always be the same. Stemming is often used when we will be satisfied with possibly a less than precise determination of the root of a word. A more thorough discussion of stemming versus lemmatization can be found at: https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/.

The last task in this chapter deals with the process of text normalization. Here, we are concerned with converting the token that is extracted to a form that can be more easily processed during later analysis. Typical normalization activities include converting cases, expanding abbreviations, removing stop words along with stemming, and lemmatization. Stop words are those words that can often be ignored with certain types of analyses. For example, in some contexts, the word the does not always need to be included.

In this chapter, we will cover the following recipes:

  • Tokenization using the Java SDK
  • Tokenization using OpenNLP
  • Tokenization using maximum entropy
  • Training a neural network tokenizer for specialized text
  • Identifying the stem of a word
  • Training an OpenNLP lemmatization model
  • Determining the lexical meaning of a word using OpenNLP
  • Removing stop words using LingPipe

Technical requirements

Tokenization using the Java SDK

Tokenization can be achieved using a number of Java classes, including the String, StringTokenizer, and StreamTokenizer classes. In this recipe, we will demonstrate the use of the Scanner class. While frequently used for console input, it can also be used to tokenize a string.

Getting ready

To prepare, we need to create a new Java project.

How to do it...

Let's go through the following steps:

  1. Add the following import statement to your project's class:
import java.util.ArrayList;
import java.util.Scanner;
  1. Add the following statements to the main method to declare the sample string, create an instance of the Scanner class, and add a list to hold the tokens:
String sampleText = 
"In addition, the rook was moved too far to be effective.";
Scanner scanner = new Scanner(sampleText);
ArrayList<String> list = new ArrayList<>();
  1. Insert the following loops to populate the list and display the tokens:
while (scanner.hasNext()) {
String token = scanner.next();
list.add(token);
}

for (String token : list) {
System.out.println(token);
}
  1. Execute the program. You should get the following output:
In
addition,
the
rook
was
moved
too
far
to
be
effective.

How it works...

The Scanner class's constructor took a string as an argument. This allowed us to apply the Scanner class's methods against the text we used in the next method, which returns a single token at a time, delimited by white spaces. While it was not necessary to store the tokens in a list, this permits us to use it later for different purposes.

Tokenization using OpenNLP

In this recipe, we will create an instance of the OpenNLP SimpleTokenizer class to illustrate tokenization. We will use its tokenize method against a sample text.

Getting ready

To prepare, we need to do the following:

  1. Create a new Java project
  2. Add the following POM dependency to your project:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.9.0</version>
</dependency>

How to do it...

Let's go through the following steps:

  1. Start by adding the following import statement to your project's class:
import opennlp.tools.tokenize.SimpleTokenizer;
  1. Next, add the following main method to your project:
public static void main(String[] args) {
String sampleText =
"In addition, the rook was moved too far to be effective.";
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
String tokenList[] = simpleTokenizer.tokenize(sampleText);
for (String token : tokenList) {
System.out.println(token);
}
}

After executing the program, you should get the following output:

In
addition
,
the
rook
was
moved
too
far
to
be
effective
.

How it works...

The SimpleTokenizer instance represents a tokenizer that will split text using white space delimiters, which are accessed through the class's INSTANCE field. With this tokenizer, we use its tokenize method to pass a single string returning an array of strings, as shown in the following code:

  String sampleText = 
"In addition, the rook was moved too far to be effective.";
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
String tokenList[] = simpleTokenizer.tokenize(sampleText);

We then iterated through the list of tokens and displayed one per line. Note how the tokenizer treats the comma and the period as tokens.

See also

Tokenization using maximum entropy

Maximum entropy is a statistical classification technique. It takes various characteristics of a subject, such as the use of specialized words or the presence of whiskers in a picture, and assigns a weight to each characteristic. These weights are eventually added up and normalized to a value between 0 and 1, indicating the probability that the subject is of a particular kind. With a high enough level of confidence, we can conclude that the text is all about high-energy physics or that we have a picture of a cat.

If you're interested, you can find a more complete explanation of this technique at https://nadesnotes.wordpress.com/2016/09/05/natural-language-processing-nlp-fundamentals-maximum-entropy-maxent/. In this recipe, we will demonstrate the use of maximum entropy with the OpenNLP TokenizerME class.

Getting ready

To prepare, we need to do the following:

  1. Create a new Maven project.
  2. Download the en-token.bin file from http://opennlp.sourceforge.net/models-1.5/. Save it at the root directory of the project.
  3. Add the following POM dependency to your project:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.9.0</version>
</dependency>

How to do it...

Let's go through the following steps:

  1. Add the following imports to the project:
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
  1. Next, add the following code to the main method. This sequence initializes the text to be processed and creates an input stream to read in the tokenization model. Modify the first argument of the File constructor to reflect the path to the model files:
String sampleText = 
"In addition, the rook was moved too far to be effective.";
try (InputStream modelInputStream = new FileInputStream(
new File("...", "en-token.bin"))) {
...
} catch (FileNotFoundException e) {
// Handle exception
} catch (IOException e) {
// Handle exception
}
  1. Add the following code to the try block. It creates a tokenizer model and then the actual tokenizer:
TokenizerModel tokenizerModel = 
new TokenizerModel(modelInputStream);
Tokenizer tokenizer = new TokenizerME(tokenizerModel);
  1. Insert the following code sequence that uses the tokenize method to create a list of tokens and then display the tokens:
String tokenList[] = tokenizer.tokenize(sampleText);
for (String token : tokenList) {
System.out.println(token);
}
  1. Next, execute the program. You should get the following output:
In
addition
,
the
rook
was
moved
too
far
to
be
effective
.

How it works...

The sampleText variable holds the test string. A try-with-resources block is used to automatically close the InputStream. The new File statement throws a FileNotFoundException, while the new TokenizerModel(modelInputStream) statement throws an IOException, both of which need to be handled.

The code examples in this book that deal with exception handling include a comment suggesting that exceptions should be handled. The user is encouraged to add the appropriate code to deal with exceptions. This will often include print statements or possibly logging operations.

An instance of the TokenizerModel class is created using the en-token.bin model. This model has been trained to recognize English text. An instance of the TokenizerME class represents the tokenizer where the tokenize method is executed against it using the sample text. This method returns an array of strings that are then displayed. Note that the comma and period are treated as separate tokens.

See also

Training a neural network tokenizer for specialized text

Sometimes, we need to work with specialized text, such as an uncommon language or text that is unique to a problem domain. In such cases, the standard tokenizers are not always sufficient. This necessitates the creation of a unique model that will work better with the specialized text. In this recipe, we will demonstrate how to train a model using OpenNLP.

Getting ready

To prepare, we need to do the following:

  1. Create a new Maven project
  2. Add the following dependency to the POM file:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.9.0</version>
</dependency>

How to do it...

Let's go through the following steps:

  1. Create a file called training-data.train. Add the following to the file:
The first sentence is terminated by a period<SPLIT>. We will want to be able to identify tokens that are separated by something other than whitespace<SPLIT>. This can include commas<SPLIT>, numbers such as 100.204<SPLIT>, and other punctuation characters including colons:<SPLIT>.
  1. Next, add the following imports to the program:
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.tokenize.TokenSample;
import opennlp.tools.tokenize.TokenSampleStream;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerFactory;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;
  1. Next, add the following try block to the project's main method that contains the code needed to obtain the training data:
InputStreamFactory inputStreamFactory = new InputStreamFactory() {
public InputStream createInputStream()
throws FileNotFoundException {
return new FileInputStream(
"C:/NLP Cookbook/Code/chapter2a/training-data.train");
}
};
  1. Insert the following code segment into the try block that will train the model and save it:
try (
ObjectStream<String> stringObjectStream =
new PlainTextByLineStream(inputStreamFactory, "UTF-8");
ObjectStream<TokenSample> tokenSampleStream =
new TokenSampleStream(stringObjectStream);) {

TokenizerModel tokenizerModel = TokenizerME.train(
tokenSampleStream, new TokenizerFactory(
"en", null, true, null),
TrainingParameters.defaultParams());
BufferedOutputStream modelOutputStream =
new BufferedOutputStream(new FileOutputStream(
new File(
"C:/NLP Cookbook/Code/chapter2a/mymodel.bin")));
tokenizerModel.serialize(modelOutputStream);
} catch (IOException ex) {
// Handle exception
}
  1. To test the new model, we will reuse the code found in the Tokenization using OpenNLP recipe. Add the following code after the preceding try block:
String sampleText = "In addition, the rook was moved too far to be effective.";
try (InputStream modelInputStream = new FileInputStream(
new File("C:/Downloads/OpenNLP/Models", "mymodel.bin"));) {
TokenizerModel tokenizerModel =
new TokenizerModel(modelInputStream);
Tokenizer tokenizer = new TokenizerME(tokenizerModel);
String tokenList[] = tokenizer.tokenize(sampleText);
for (String token : tokenList) {
System.out.println(token);
}
} catch (FileNotFoundException e) {
// Handle exception
} catch (IOException e) {
// Handle exception
}
  1. When executing the program, you will get an output similar to the following. Some of the training model output has been removed to save space:
Indexing events with TwoPass using cutoff of 5

Computing event counts... done. 36 events
Indexing... done.
Sorting and merging events... done. Reduced 36 events to 12.
Done indexing in 0.21 s.
Incorporating indexed data for training...
done.
Number of Event Tokens: 12
Number of Outcomes: 2
Number of Predicates: 9
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-24.95329850015802 0.8611111111111112
2: ... loglikelihood=-14.200654164477221 0.8611111111111112
3: ... loglikelihood=-11.526745527757855 0.8611111111111112
4: ... loglikelihood=-9.984657035211438 0.8888888888888888
...
97: ... loglikelihood=-0.7805227945549726 1.0
98: ... loglikelihood=-0.7730211829010772 1.0
99: ... loglikelihood=-0.765664507836384 1.0
100: ... loglikelihood=-0.7584485899716518 1.0
In
addition
,
the
rook
was
moved
too
far
to
be
effective
.

How it works...

To understand how this all works, we will explain the training code, the testing code, and the output. We will start with the training code.

To create a model, we need test data that was saved in the training-data.train file. Its contents are as follows:

These fields are used to provide further information about how tokens should be identified<SPLIT>.  They can help identify breaks between numbers<SPLIT>, such as 23.6<SPLIT>, punctuation characters such as commas<SPLIT>.

The <SPLIT> markup has been added just before the places where the tokenizer should split code, in locations rather than white spaces. Normally, we would use a larger set of data to obtain a better model. For our purposes, this file will work.

We created an instance of the InputStreamFactory to represent the training data file, as shown in the following code:

InputStreamFactory inputStreamFactory = new InputStreamFactory() {
public InputStream createInputStream()
throws FileNotFoundException {
return new FileInputStream("training-data.train");
}
};

An object stream is created in the try block that read from the file. The PlainTextByLineStream class processes plain text line by line. This stream was then used to create another input stream of TokenSample objects, providing a usable form for training the model, as shown in the following code:

try (
ObjectStream<String> stringObjectStream =
new PlainTextByLineStream(inputStreamFactory, "UTF-8");
ObjectStream<TokenSample> tokenSampleStream =
new TokenSampleStream(stringObjectStream);) {
...
} catch (IOException ex) {
// Handle exception
}

The train method performed the training. It takes the token stream, a TokenizerFactory instance, and a set of training parameters. The TokenizerFactory instance provides the basic tokenizer. Its arguments include the language used and other factors, such as an abbreviation dictionary. In this example, English is the language, and the other arguments are not used. We used the default set of training parameters, as shown in the following code:

TokenizerModel tokenizerModel = TokenizerME.train(
tokenSampleStream, new TokenizerFactory("en", null, true, null),
TrainingParameters.defaultParams());

Once the model was trained, we saved it to the mymodel.bin file using the serialize method:

BufferedOutputStream modelOutputStream = new BufferedOutputStream(
new FileOutputStream(new File("mymodel.bin")));
tokenizerModel.serialize(modelOutputStream);

To test the model, we reused the tokenization code found in the Tokenization using the OpenNLP recipe. You can refer to that recipe for an explanation of the code.

The output of the preceding code displays various statistics, such as the number of passes and iterations performed. One token was displayed per line, as shown in the following code. Note that the comma and period are treated as separate tokens using this model:

In
addition
,
the
rook
was
moved
too
far
to
be
effective
.

There's more...

See also

Identifying the stem of a word

Finding the stem of a word is easy to do. We will illustrate this process using OpenNLP’s PorterStemmer class.

Getting ready

To prepare, we need to do the following:

  1. Create a new Maven project
  2. Add the following dependency to the POM file:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.9.0</version>
</dependency>

How to do it...

Let's go through the following steps:

  1. Add the following import statement to the program:
import opennlp.tools.stemmer.PorterStemmer;
  1. Then, add the following code to the main method:
String wordList[] = 
{ "draft", "drafted", "drafting", "drafts",
"drafty", "draftsman" };
PorterStemmer porterStemmer = new PorterStemmer();
for (String word : wordList) {
String stem = porterStemmer.stem(word);
System.out.println("The stem of " + word + " is " + stem);
}
  1. Execute the program. The output should be as follows:
The stem of drafted is draft
The stem of drafting is draft
The stem of drafts is draft
The stem of drafty is drafti
The stem of draftsman is draftsman

How it works...

We start by creating an array of strings that will hold words that we will use with the stemmer:

String wordList[] = 
{ "draft", "drafted", "drafting", "drafts", "drafty", "draftsman" };

The OpenNLP PorterStemmer class supports finding the stem of a word. It has a single default constructor that is used to create an instance of the class, as shown in the following code. This is the only constructor available for this class:

PorterStemmer porterStemmer = new PorterStemmer();

The remainder of the code iterates over the array and invokes the stem method against each word in the array, as shown in the following code:

for (String word : wordList) {
String stem = porterStemmer.stem(word);
System.out.println("The stem of " + word + " is " + stem);
}

See also

Training an OpenNLP lemmatization model

We will train a model using OpenNLP, which can be used to perform lemmatization. The actual process of performing lemmatization is illustrated in the following recipe, Determining the lexical meaning of a word using OpenNLP.

Getting ready

The most straightforward technique to train a model is to use the OpenNLP command-line tools. Download these tools from the OpenNLP page at https://opennlp.apache.org/download.html. We will not need the source code for these tools, so download the file named apache-opennlp-1.9.0-bin.tar.gz. Selecting that file will take you to a page that lists mirror sites for the file. Choose one that will work best for your location.

Once the file has been saved, expand the file. This will extract a .tar file. Next, expand this file, which will create a directory called apache-opennlp-1.9.0. In its bin subdirectory, you will find the tools that we need.

We will need training data for the training process. We will use the en-lemmatizer.dict file found at https://raw.githubusercontent.com/richardwilly98/elasticsearch-opennlp-auto-tagging/master/src/main/resources/models/en-lemmatizer.dict. Use a browser to open this page and then save this page using the file name en-lemmatizer.dict.

How to do it...

Let's go through the following steps:

  1. Open a command-line window. We used the Window's cmd program in this example
  2. Set up a path for the OpenNLP tool's bin directory and then navigate to the directory containing the en-lemmatizer.dict file.
  3. Execute the following command:
opennlp LemmatizerTrainerME -model en-lemmatizer.bin -lang en -data en-lemmatizer.dict -encoding UTF-8

You will get the following output. It has been shortened here to save space:

Indexing events with TwoPass using cutoff of 5
Computing event counts... done. 301403 events Indexing... done.

Sorting and merging events... done. Reduced 301403 events to 297777.
Done indexing in 9.09 s.

Incorporating indexed data for training...
done.
Number of Event Tokens: 297777
Number of Outcomes: 432
Number of Predicates: 69122
...done.

Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-1829041.6775780176 3.317817009120679E-6
2: ... loglikelihood=-452333.43760414346 0.876829361353404
3: ... loglikelihood=-211099.05280473927 0.9506806501594212
4: ... loglikelihood=-132195.3981804198 0.9667554735686108
...
98: ... loglikelihood=-6702.5821153954375 0.9988420818638168
99: ... loglikelihood=-6652.6134177562335 0.998845399680826
100: ... loglikelihood=-6603.518040975329 0.9988553531318534

Writing lemmatizer model
... done (1.274s)
Wrote lemmatizer model to
path: C:\Downloads\OpenNLP\en-lemmatizer.bin

Execution time: 275.369 seconds

How it works...

To understand the output, we need to explain the following command:

opennlp LemmatizerTrainerME -model en-lemmatizer.bin -lang en -data en-lemmatizer.dict -encoding UTF-8

The opennlp command is used with a number of OpenNLP tools. The tool to be used is specified by the command's first argument. In this example, we used the LemmatizerTrainerME tool. The arguments that follow control how the training process works. The LemmatizerTrainerME arguments are documented at https://opennlp.apache.org/docs/1.9.0/manual/opennlp.html#tools.cli.lemmatizer.LemmatizerTrainerME.

We use the -model, -lang, -data, and -encoding arguments, as detailed in the following list:

  • The -model argument specifies the name of the model output file. This is the file that holds the trained model that we will use in the next recipe.
  • The -lang argument specifies the natural language used. In this case, we use en, which indicates the training data is English.
  • The -data argument specifies the file containing the training data. We used the en-lemmatizer.dict file.
  • The -encoding parameter specifies the character set used by the training data. We used UTF-8, which indicates the data is Unicode data.

The output shows the training process. It displays various statistics, such as the number of passes and iterations performed. During each iteration, the probability increases, as shown in the following code. With the 100th iteration, the probability approaches 100.

Performing 100 iterations:

1: ... loglikelihood=-1829041.6775780176 3.317817009120679E-6
2: ... loglikelihood=-452333.43760414346 0.876829361353404
3: ... loglikelihood=-211099.05280473927 0.9506806501594212
4: ... loglikelihood=-132195.3981804198 0.9667554735686108
...
98: ... loglikelihood=-6702.5821153954375 0.9988420818638168
99: ... loglikelihood=-6652.6134177562335 0.998845399680826
100: ... loglikelihood=-6603.518040975329 0.9988553531318534
Writing lemmatizer model ... done (1.274s)

The final part of the output shows where the file is written. We wrote the lemmatizer model to the path :\Downloads\OpenNLP\en-lemmatizer.bin.

There's more...

If you have specialized lemmatization needs, then you will need to create a training file. The training data file consists of a series of lines. Each line consists of three entries separated by spaces. The first entry contains a word. The second entry is the POS tag for the word. The third entry is the lemma for the word.

For example, in en-lemmatizer.dict, there are several lines for variations of the word bump, as shown in the following code:

bump    NN         bump
bump VB bump
bump VBP bump
bumped VBD bump
bumped VBN bump
bumper JJ bumper
bumper NN bumper

As you can see, a word may be used in different contexts and with different suffixes. Other datasets can be used for training. These include the Penn Treebank (https://web.archive.org/web/19970614160127/http://www.cis.upenn.edu/~treebank/) and the CoNLL 2009 datasets (https://www.ldc.upenn.edu/).

Training parameters other than the default parameters can be specified depending on the needs of the problem.

In the next recipe, Determining the lexical meaning of a word using OpenNLP, we will use the model to develop and determine the lexical meaning of a word.

See also

Determining the lexical meaning of a word using OpenNLP

In this recipe, we will use the model we created in the previous recipe to perform lemmatization. We will perform lemmatization on the following sentence:

The girls were leaving the clubhouse for another adventurous afternoon.

In the example, the lemmas for each word in the sentence will be displayed.

Getting ready

To prepare, we need to do the following:

  1. Create a new Maven project
  2. Add the following dependency to the POM file:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.9.0</version>
</dependency>

How to do it...

Let's go through the following steps:

  1. Add the following imports to the project:
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.lemmatizer.LemmatizerME;
import opennlp.tools.lemmatizer.LemmatizerModel;
  1. Add the following try block to the main method. An input stream and model are created, followed by the instantiation of the lemmatization model:
LemmatizerModel lemmatizerModel = null;
try (InputStream modelInputStream = new FileInputStream(
"C:\\Downloads\\OpenNLP\\en-lemmatizer.bin")) {
lemmatizerModel = new LemmatizerModel(modelInputStream);
LemmatizerME lemmatizer = new LemmatizerME(lemmatizerModel);

} catch (FileNotFoundException e) {
// Handle exception
} catch (IOException e) {
// Handle exception
}

  1. Add the following code to the end of the try block. It sets up arrays holding the words of the sample text and their POS tags. It then performs the lemmatization and displays the results:
String[] tokens = new String[] { 
"The", "girls", "were", "leaving", "the",
"clubhouse", "for", "another", "adventurous",
"afternoon", "." };
String[] posTags = new String[] { "DT", "NNS", "VBD",
"VBG", "DT", "NN", "IN", "DT", "JJ", "NN", "." };
String[] lemmas = lemmatizer.lemmatize(tokens, posTags);
for (int i = 0; i < tokens.length; i++) {
System.out.println(tokens[i] + " - " + lemmas[i]);
}
  1. Upon executing the program, you will get the following output that displays each word and then its lemma:
The - the
girls - girl
were - be
leaving - leave
the - the
clubhouse - clubhouse
for - for
another - another
adventurous - adventurous
afternoon - afternoon
. - .

How it works...

We performed lemmatization on the sentence The girls were leaving the clubhouse for another adventurous afternoon. A LemmatizerModel was declared and instantiated from the en-lemmatizer.bin file. A try-with-resources block was used to obtained an input stream for the file, as shown in the following code:

LemmatizerModel lemmatizerModel = null;
try (InputStream modelInputStream = new FileInputStream(
"C:\\Downloads\\OpenNLP\\en-lemmatizer.bin")) {
lemmatizerModel = new LemmatizerModel(modelInputStream);

Next, the lemmatizer was created using the LemmatizerME class, as shown in the following code:

LemmatizerME lemmatizer = new LemmatizerME(lemmatizerModel);

The following sentence was processed, and is represented as an array of strings. We also need an array of POS tags for the lemmatization process to work. This array was defined in parallel with the sentence array. As we will see in Chapter 4, Detecting POS Using Neural Networks, there are often alternative tags that are possible for a sentence. For this example, we used tags generated by the Cognitive Computation Group's online tool at http://cogcomp.org/page/demo_view/pos:

String[] tokens = new String[] { 
"The", "girls", "were", "leaving", "the",
"clubhouse", "for", "another", "adventurous",
"afternoon", "." };
String[] posTags = new String[] { "DT", "NNS", "VBD",
"VBG", "DT", "NN", "IN", "DT", "JJ", "NN", "." };

The lemmatization then occurred, where the lemmatize method uses the two arrays to build an array of lemmas for each word in the sentence, as shown in the following code:

String[] lemmas = lemmatizer.lemmatize(tokens, posTags);

The lemmas are then displayed, as shown in the following code:

for (int i = 0; i < tokens.length; i++) {
System.out.println(tokens[i] + " - " + lemmas[i]);
}

See also

Removing stop words using LingPipe

Normalization is the process of preparing text for subsequent analysis. This is frequently performed once the text has been tokenized. Normalization activities include such tasks as converting the text to lowercase, validating data, inserting missing elements, stemming, lemmatization, and removing stop words.

We have already examined the stemming and lemmatization process in earlier recipes. In this recipe, we will show how stop words can be removed. Stop words are those words that are not always useful. For example, some downstream NLP tasks do not need to have words such as a, the, or and. These types of words are the common words found in a language. Analysis can often be enhanced by removing them from a text.

Getting ready

To prepare, we need to do the following:

  1. Create a new Maven project
  2. Add the following dependency to the POM file:
<dependency>
<groupId>de.julielab</groupId>
<artifactId>aliasi-lingpipe</artifactId>
<version>4.1.0</version>
</dependency>

How to do it...

Let's go through the following steps:

  1. Add the following import statements to your program:
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.tokenizer.EnglishStopTokenizerFactory;
  1. Add the following code to the main method:
String sentence = 
"The blue goose and a quiet lamb stopped to smell the roses.";
TokenizerFactory tokenizerFactory =
IndoEuropeanTokenizerFactory.INSTANCE;
tokenizerFactory =
new EnglishStopTokenizerFactory(tokenizerFactory);
Tokenizer tokenizer =tokenizerFactory.tokenizer(
sentence.toCharArray(), 0, sentence.length());
for (String token : tokenizer) {
System.out.println(token);
}
  1. Execute the program. You will get the following output:
The
blue
goose
quiet
lamb
stopped
smell
roses
.

How it works...

The example started with the declaration of a sample sentence. The program will return a list of words found in the sentence with the stop words removed, as shown in the following code:

String sentence = 
"The blue goose and a quiet lamb stopped to smell the roses.";

An instance of LingPipe's IndoEuropeanTokenizerFactory is used to provide a means of tokenizing the sentence. It is used as the argument to the EnglishStopTokenizerFactory constructor, which provides a stop word tokenizer, as shown in the following code:

TokenizerFactory tokenizerFactory =
IndoEuropeanTokenizerFactory.INSTANCE;
tokenizerFactory = new EnglishStopTokenizerFactory(tokenizerFactory);

The tokenizer method is invoked against the sentence, where its second and third parameters specify which part of the sentence to tokenize. The Tokenizer class represents the tokens extracted from the sentence:

Tokenizer tokenizer = tokenizerFactory.tokenizer(
sentence.toCharArray(), 0, sentence.length());

The Tokenizer class implements the Iterable<String> interface that we utilized in the following for-each statement to display the tokens:

for (String token : tokenizer) {
System.out.println(token);
}

Note that in the following duplicated output, the first word of the sentence, The, was not removed, nor was there a terminating period. Otherwise, common stop words were removed, as shown in the following code:

The
blue
goose
quiet
lamb
stopped
smell
roses
.

See also

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Perform simple-to-complex NLP text processing tasks using modern Java libraries
  • Extract relationships between different text complexities using a problem-solution approach
  • Utilize cloud-based APIs to perform machine translation operations

Description

Natural Language Processing (NLP) has become one of the prime technologies for processing very large amounts of unstructured data from disparate information sources. This book includes a wide set of recipes and quick methods that solve challenges in text syntax, semantics, and speech tasks. At the beginning of the book, you'll learn important NLP techniques, such as identifying parts of speech, tagging words, and analyzing word semantics. You will learn how to perform lexical analysis and use machine learning techniques to speed up NLP operations. With independent recipes, you will explore techniques for customizing your existing NLP engines/models using Java libraries such as OpenNLP and the Stanford NLP library. You will also learn how to use NLP processing features from cloud-based sources, including Google and Amazon Web Services (AWS). You will master core tasks, such as stemming, lemmatization, part-of-speech tagging, and named entity recognition. You will also learn about sentiment analysis, semantic text similarity, language identification, machine translation, and text summarization. By the end of this book, you will be ready to become a professional NLP expert using a problem-solution approach to analyze any sort of text, sentence, or semantic word.

Who is this book for?

This book is for data scientists, NLP engineers, and machine learning developers who want to perform their work on linguistic applications faster with the use of popular libraries on JVM machines. This book will help you build real-world NLP applications using a recipe-based approach. Prior knowledge of Natural Language Processing basics and Java programming is expected.

What you will learn

  • Explore how to use tokenizers in NLP processing
  • Implement NLP techniques in machine learning and deep learning applications
  • Identify sentences within text and learn how to train specialized NER models
  • Learn how to classify documents and perform sentiment analysis
  • Find semantic similarities between text elements and extract text from a variety of sources
  • Preprocess text from a variety of data sources
  • Learn how to identify and translate languages
Estimated delivery fee Deliver to Finland

Premium delivery 7 - 10 business days

€17.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Apr 25, 2019
Length: 386 pages
Edition : 1st
Language : English
ISBN-13 : 9781789801156
Vendor :
Oracle
Category :
Languages :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Estimated delivery fee Deliver to Finland

Premium delivery 7 - 10 business days

€17.95
(Includes tracking information)

Product Details

Publication date : Apr 25, 2019
Length: 386 pages
Edition : 1st
Language : English
ISBN-13 : 9781789801156
Vendor :
Oracle
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 95.97
Learn Java 12 Programming
€29.99
Hands-On Design Patterns with Java
€32.99
Natural Language Processing with Java Cookbook
€32.99
Total 95.97 Stars icon

Table of Contents

13 Chapters
Preparing Text for Analysis and Tokenization Chevron down icon Chevron up icon
Isolating Sentences within a Document Chevron down icon Chevron up icon
Performing Name Entity Recognition Chevron down icon Chevron up icon
Detecting POS Using Neural Networks Chevron down icon Chevron up icon
Performing Text Classification Chevron down icon Chevron up icon
Finding Relationships within Text Chevron down icon Chevron up icon
Language Identification and Translation Chevron down icon Chevron up icon
Identifying Semantic Similarities within Text Chevron down icon Chevron up icon
Common Text Processing and Generation Tasks Chevron down icon Chevron up icon
Extracting Data for Use in NLP Analysis Chevron down icon Chevron up icon
Creating a Chatbot Chevron down icon Chevron up icon
Installation and Configuration Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela