Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Natural Language Processing with Java Cookbook
Natural Language Processing with Java Cookbook

Natural Language Processing with Java Cookbook: Over 70 recipes to create linguistic and language translation applications using Java libraries

Arrow left icon
Profile Icon Richard M. Reese Profile Icon Richard M Reese
Arrow right icon
$9.99 $29.99
eBook Apr 2019 386 pages 1st Edition
eBook
$9.99 $29.99
Paperback
$43.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Richard M. Reese Profile Icon Richard M Reese
Arrow right icon
$9.99 $29.99
eBook Apr 2019 386 pages 1st Edition
eBook
$9.99 $29.99
Paperback
$43.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$9.99 $29.99
Paperback
$43.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Natural Language Processing with Java Cookbook

Preparing Text for Analysis and Tokenization

One of the first steps required for Natural Language Processing (NLP) is the extraction of tokens in text. The process of tokenization splits text into tokens—that is, words. Normally, tokens are split based upon delimiters, such as white space. White space includes blanks, tabs, and carriage-return line feeds. However, specialized tokenizers can split tokens according to other delimiters. In this chapter, we will illustrate several tokenizers that you will find useful in your analysis.

Another important NLP task involves determining the stem and lexical meaning of a word. This is useful for deriving more meaning about the words beings processed, as illustrated in the fifth and sixth recipe. The stem of a word refers to the root of a word. For example, the stem of the word antiquated is antiqu. While this may not seem to be the correct stem, the stem of a word is the ultimate base of the word.

The lexical meaning of a word is not concerned with the context in which it is being used. We will be examining the process of performing lemmatization of a word. This is also concerned with finding the root of a word, but uses a more detailed dictionary to find the root. The stem of a word may vary depending on the form the word takes. However, with lemmatization, the root will always be the same. Stemming is often used when we will be satisfied with possibly a less than precise determination of the root of a word. A more thorough discussion of stemming versus lemmatization can be found at: https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/.

The last task in this chapter deals with the process of text normalization. Here, we are concerned with converting the token that is extracted to a form that can be more easily processed during later analysis. Typical normalization activities include converting cases, expanding abbreviations, removing stop words along with stemming, and lemmatization. Stop words are those words that can often be ignored with certain types of analyses. For example, in some contexts, the word the does not always need to be included.

In this chapter, we will cover the following recipes:

  • Tokenization using the Java SDK
  • Tokenization using OpenNLP
  • Tokenization using maximum entropy
  • Training a neural network tokenizer for specialized text
  • Identifying the stem of a word
  • Training an OpenNLP lemmatization model
  • Determining the lexical meaning of a word using OpenNLP
  • Removing stop words using LingPipe

Technical requirements

Tokenization using the Java SDK

Tokenization can be achieved using a number of Java classes, including the String, StringTokenizer, and StreamTokenizer classes. In this recipe, we will demonstrate the use of the Scanner class. While frequently used for console input, it can also be used to tokenize a string.

Getting ready

To prepare, we need to create a new Java project.

How to do it...

Let's go through the following steps:

  1. Add the following import statement to your project's class:
import java.util.ArrayList;
import java.util.Scanner;
  1. Add the following statements to the main method to declare the sample string, create an instance of the Scanner class, and add a list to hold the tokens:
String sampleText = 
"In addition, the rook was moved too far to be effective.";
Scanner scanner = new Scanner(sampleText);
ArrayList<String> list = new ArrayList<>();
  1. Insert the following loops to populate the list and display the tokens:
while (scanner.hasNext()) {
String token = scanner.next();
list.add(token);
}

for (String token : list) {
System.out.println(token);
}
  1. Execute the program. You should get the following output:
In
addition,
the
rook
was
moved
too
far
to
be
effective.

How it works...

The Scanner class's constructor took a string as an argument. This allowed us to apply the Scanner class's methods against the text we used in the next method, which returns a single token at a time, delimited by white spaces. While it was not necessary to store the tokens in a list, this permits us to use it later for different purposes.

Tokenization using OpenNLP

In this recipe, we will create an instance of the OpenNLP SimpleTokenizer class to illustrate tokenization. We will use its tokenize method against a sample text.

Getting ready

To prepare, we need to do the following:

  1. Create a new Java project
  2. Add the following POM dependency to your project:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.9.0</version>
</dependency>

How to do it...

Let's go through the following steps:

  1. Start by adding the following import statement to your project's class:
import opennlp.tools.tokenize.SimpleTokenizer;
  1. Next, add the following main method to your project:
public static void main(String[] args) {
String sampleText =
"In addition, the rook was moved too far to be effective.";
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
String tokenList[] = simpleTokenizer.tokenize(sampleText);
for (String token : tokenList) {
System.out.println(token);
}
}

After executing the program, you should get the following output:

In
addition
,
the
rook
was
moved
too
far
to
be
effective
.

How it works...

The SimpleTokenizer instance represents a tokenizer that will split text using white space delimiters, which are accessed through the class's INSTANCE field. With this tokenizer, we use its tokenize method to pass a single string returning an array of strings, as shown in the following code:

  String sampleText = 
"In addition, the rook was moved too far to be effective.";
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
String tokenList[] = simpleTokenizer.tokenize(sampleText);

We then iterated through the list of tokens and displayed one per line. Note how the tokenizer treats the comma and the period as tokens.

See also

Tokenization using maximum entropy

Maximum entropy is a statistical classification technique. It takes various characteristics of a subject, such as the use of specialized words or the presence of whiskers in a picture, and assigns a weight to each characteristic. These weights are eventually added up and normalized to a value between 0 and 1, indicating the probability that the subject is of a particular kind. With a high enough level of confidence, we can conclude that the text is all about high-energy physics or that we have a picture of a cat.

If you're interested, you can find a more complete explanation of this technique at https://nadesnotes.wordpress.com/2016/09/05/natural-language-processing-nlp-fundamentals-maximum-entropy-maxent/. In this recipe, we will demonstrate the use of maximum entropy with the OpenNLP TokenizerME class.

Getting ready

To prepare, we need to do the following:

  1. Create a new Maven project.
  2. Download the en-token.bin file from http://opennlp.sourceforge.net/models-1.5/. Save it at the root directory of the project.
  3. Add the following POM dependency to your project:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.9.0</version>
</dependency>

How to do it...

Let's go through the following steps:

  1. Add the following imports to the project:
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
  1. Next, add the following code to the main method. This sequence initializes the text to be processed and creates an input stream to read in the tokenization model. Modify the first argument of the File constructor to reflect the path to the model files:
String sampleText = 
"In addition, the rook was moved too far to be effective.";
try (InputStream modelInputStream = new FileInputStream(
new File("...", "en-token.bin"))) {
...
} catch (FileNotFoundException e) {
// Handle exception
} catch (IOException e) {
// Handle exception
}
  1. Add the following code to the try block. It creates a tokenizer model and then the actual tokenizer:
TokenizerModel tokenizerModel = 
new TokenizerModel(modelInputStream);
Tokenizer tokenizer = new TokenizerME(tokenizerModel);
  1. Insert the following code sequence that uses the tokenize method to create a list of tokens and then display the tokens:
String tokenList[] = tokenizer.tokenize(sampleText);
for (String token : tokenList) {
System.out.println(token);
}
  1. Next, execute the program. You should get the following output:
In
addition
,
the
rook
was
moved
too
far
to
be
effective
.

How it works...

The sampleText variable holds the test string. A try-with-resources block is used to automatically close the InputStream. The new File statement throws a FileNotFoundException, while the new TokenizerModel(modelInputStream) statement throws an IOException, both of which need to be handled.

The code examples in this book that deal with exception handling include a comment suggesting that exceptions should be handled. The user is encouraged to add the appropriate code to deal with exceptions. This will often include print statements or possibly logging operations.

An instance of the TokenizerModel class is created using the en-token.bin model. This model has been trained to recognize English text. An instance of the TokenizerME class represents the tokenizer where the tokenize method is executed against it using the sample text. This method returns an array of strings that are then displayed. Note that the comma and period are treated as separate tokens.

See also

Training a neural network tokenizer for specialized text

Sometimes, we need to work with specialized text, such as an uncommon language or text that is unique to a problem domain. In such cases, the standard tokenizers are not always sufficient. This necessitates the creation of a unique model that will work better with the specialized text. In this recipe, we will demonstrate how to train a model using OpenNLP.

Getting ready

To prepare, we need to do the following:

  1. Create a new Maven project
  2. Add the following dependency to the POM file:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.9.0</version>
</dependency>

How to do it...

Let's go through the following steps:

  1. Create a file called training-data.train. Add the following to the file:
The first sentence is terminated by a period<SPLIT>. We will want to be able to identify tokens that are separated by something other than whitespace<SPLIT>. This can include commas<SPLIT>, numbers such as 100.204<SPLIT>, and other punctuation characters including colons:<SPLIT>.
  1. Next, add the following imports to the program:
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.tokenize.TokenSample;
import opennlp.tools.tokenize.TokenSampleStream;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerFactory;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;
  1. Next, add the following try block to the project's main method that contains the code needed to obtain the training data:
InputStreamFactory inputStreamFactory = new InputStreamFactory() {
public InputStream createInputStream()
throws FileNotFoundException {
return new FileInputStream(
"C:/NLP Cookbook/Code/chapter2a/training-data.train");
}
};
  1. Insert the following code segment into the try block that will train the model and save it:
try (
ObjectStream<String> stringObjectStream =
new PlainTextByLineStream(inputStreamFactory, "UTF-8");
ObjectStream<TokenSample> tokenSampleStream =
new TokenSampleStream(stringObjectStream);) {

TokenizerModel tokenizerModel = TokenizerME.train(
tokenSampleStream, new TokenizerFactory(
"en", null, true, null),
TrainingParameters.defaultParams());
BufferedOutputStream modelOutputStream =
new BufferedOutputStream(new FileOutputStream(
new File(
"C:/NLP Cookbook/Code/chapter2a/mymodel.bin")));
tokenizerModel.serialize(modelOutputStream);
} catch (IOException ex) {
// Handle exception
}
  1. To test the new model, we will reuse the code found in the Tokenization using OpenNLP recipe. Add the following code after the preceding try block:
String sampleText = "In addition, the rook was moved too far to be effective.";
try (InputStream modelInputStream = new FileInputStream(
new File("C:/Downloads/OpenNLP/Models", "mymodel.bin"));) {
TokenizerModel tokenizerModel =
new TokenizerModel(modelInputStream);
Tokenizer tokenizer = new TokenizerME(tokenizerModel);
String tokenList[] = tokenizer.tokenize(sampleText);
for (String token : tokenList) {
System.out.println(token);
}
} catch (FileNotFoundException e) {
// Handle exception
} catch (IOException e) {
// Handle exception
}
  1. When executing the program, you will get an output similar to the following. Some of the training model output has been removed to save space:
Indexing events with TwoPass using cutoff of 5

Computing event counts... done. 36 events
Indexing... done.
Sorting and merging events... done. Reduced 36 events to 12.
Done indexing in 0.21 s.
Incorporating indexed data for training...
done.
Number of Event Tokens: 12
Number of Outcomes: 2
Number of Predicates: 9
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-24.95329850015802 0.8611111111111112
2: ... loglikelihood=-14.200654164477221 0.8611111111111112
3: ... loglikelihood=-11.526745527757855 0.8611111111111112
4: ... loglikelihood=-9.984657035211438 0.8888888888888888
...
97: ... loglikelihood=-0.7805227945549726 1.0
98: ... loglikelihood=-0.7730211829010772 1.0
99: ... loglikelihood=-0.765664507836384 1.0
100: ... loglikelihood=-0.7584485899716518 1.0
In
addition
,
the
rook
was
moved
too
far
to
be
effective
.

How it works...

To understand how this all works, we will explain the training code, the testing code, and the output. We will start with the training code.

To create a model, we need test data that was saved in the training-data.train file. Its contents are as follows:

These fields are used to provide further information about how tokens should be identified<SPLIT>.  They can help identify breaks between numbers<SPLIT>, such as 23.6<SPLIT>, punctuation characters such as commas<SPLIT>.

The <SPLIT> markup has been added just before the places where the tokenizer should split code, in locations rather than white spaces. Normally, we would use a larger set of data to obtain a better model. For our purposes, this file will work.

We created an instance of the InputStreamFactory to represent the training data file, as shown in the following code:

InputStreamFactory inputStreamFactory = new InputStreamFactory() {
public InputStream createInputStream()
throws FileNotFoundException {
return new FileInputStream("training-data.train");
}
};

An object stream is created in the try block that read from the file. The PlainTextByLineStream class processes plain text line by line. This stream was then used to create another input stream of TokenSample objects, providing a usable form for training the model, as shown in the following code:

try (
ObjectStream<String> stringObjectStream =
new PlainTextByLineStream(inputStreamFactory, "UTF-8");
ObjectStream<TokenSample> tokenSampleStream =
new TokenSampleStream(stringObjectStream);) {
...
} catch (IOException ex) {
// Handle exception
}

The train method performed the training. It takes the token stream, a TokenizerFactory instance, and a set of training parameters. The TokenizerFactory instance provides the basic tokenizer. Its arguments include the language used and other factors, such as an abbreviation dictionary. In this example, English is the language, and the other arguments are not used. We used the default set of training parameters, as shown in the following code:

TokenizerModel tokenizerModel = TokenizerME.train(
tokenSampleStream, new TokenizerFactory("en", null, true, null),
TrainingParameters.defaultParams());

Once the model was trained, we saved it to the mymodel.bin file using the serialize method:

BufferedOutputStream modelOutputStream = new BufferedOutputStream(
new FileOutputStream(new File("mymodel.bin")));
tokenizerModel.serialize(modelOutputStream);

To test the model, we reused the tokenization code found in the Tokenization using the OpenNLP recipe. You can refer to that recipe for an explanation of the code.

The output of the preceding code displays various statistics, such as the number of passes and iterations performed. One token was displayed per line, as shown in the following code. Note that the comma and period are treated as separate tokens using this model:

In
addition
,
the
rook
was
moved
too
far
to
be
effective
.

There's more...

See also

Identifying the stem of a word

Finding the stem of a word is easy to do. We will illustrate this process using OpenNLP’s PorterStemmer class.

Getting ready

To prepare, we need to do the following:

  1. Create a new Maven project
  2. Add the following dependency to the POM file:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.9.0</version>
</dependency>

How to do it...

Let's go through the following steps:

  1. Add the following import statement to the program:
import opennlp.tools.stemmer.PorterStemmer;
  1. Then, add the following code to the main method:
String wordList[] = 
{ "draft", "drafted", "drafting", "drafts",
"drafty", "draftsman" };
PorterStemmer porterStemmer = new PorterStemmer();
for (String word : wordList) {
String stem = porterStemmer.stem(word);
System.out.println("The stem of " + word + " is " + stem);
}
  1. Execute the program. The output should be as follows:
The stem of drafted is draft
The stem of drafting is draft
The stem of drafts is draft
The stem of drafty is drafti
The stem of draftsman is draftsman

How it works...

We start by creating an array of strings that will hold words that we will use with the stemmer:

String wordList[] = 
{ "draft", "drafted", "drafting", "drafts", "drafty", "draftsman" };

The OpenNLP PorterStemmer class supports finding the stem of a word. It has a single default constructor that is used to create an instance of the class, as shown in the following code. This is the only constructor available for this class:

PorterStemmer porterStemmer = new PorterStemmer();

The remainder of the code iterates over the array and invokes the stem method against each word in the array, as shown in the following code:

for (String word : wordList) {
String stem = porterStemmer.stem(word);
System.out.println("The stem of " + word + " is " + stem);
}

See also

Training an OpenNLP lemmatization model

We will train a model using OpenNLP, which can be used to perform lemmatization. The actual process of performing lemmatization is illustrated in the following recipe, Determining the lexical meaning of a word using OpenNLP.

Getting ready

The most straightforward technique to train a model is to use the OpenNLP command-line tools. Download these tools from the OpenNLP page at https://opennlp.apache.org/download.html. We will not need the source code for these tools, so download the file named apache-opennlp-1.9.0-bin.tar.gz. Selecting that file will take you to a page that lists mirror sites for the file. Choose one that will work best for your location.

Once the file has been saved, expand the file. This will extract a .tar file. Next, expand this file, which will create a directory called apache-opennlp-1.9.0. In its bin subdirectory, you will find the tools that we need.

We will need training data for the training process. We will use the en-lemmatizer.dict file found at https://raw.githubusercontent.com/richardwilly98/elasticsearch-opennlp-auto-tagging/master/src/main/resources/models/en-lemmatizer.dict. Use a browser to open this page and then save this page using the file name en-lemmatizer.dict.

How to do it...

Let's go through the following steps:

  1. Open a command-line window. We used the Window's cmd program in this example
  2. Set up a path for the OpenNLP tool's bin directory and then navigate to the directory containing the en-lemmatizer.dict file.
  3. Execute the following command:
opennlp LemmatizerTrainerME -model en-lemmatizer.bin -lang en -data en-lemmatizer.dict -encoding UTF-8

You will get the following output. It has been shortened here to save space:

Indexing events with TwoPass using cutoff of 5
Computing event counts... done. 301403 events Indexing... done.

Sorting and merging events... done. Reduced 301403 events to 297777.
Done indexing in 9.09 s.

Incorporating indexed data for training...
done.
Number of Event Tokens: 297777
Number of Outcomes: 432
Number of Predicates: 69122
...done.

Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-1829041.6775780176 3.317817009120679E-6
2: ... loglikelihood=-452333.43760414346 0.876829361353404
3: ... loglikelihood=-211099.05280473927 0.9506806501594212
4: ... loglikelihood=-132195.3981804198 0.9667554735686108
...
98: ... loglikelihood=-6702.5821153954375 0.9988420818638168
99: ... loglikelihood=-6652.6134177562335 0.998845399680826
100: ... loglikelihood=-6603.518040975329 0.9988553531318534

Writing lemmatizer model
... done (1.274s)
Wrote lemmatizer model to
path: C:\Downloads\OpenNLP\en-lemmatizer.bin

Execution time: 275.369 seconds

How it works...

To understand the output, we need to explain the following command:

opennlp LemmatizerTrainerME -model en-lemmatizer.bin -lang en -data en-lemmatizer.dict -encoding UTF-8

The opennlp command is used with a number of OpenNLP tools. The tool to be used is specified by the command's first argument. In this example, we used the LemmatizerTrainerME tool. The arguments that follow control how the training process works. The LemmatizerTrainerME arguments are documented at https://opennlp.apache.org/docs/1.9.0/manual/opennlp.html#tools.cli.lemmatizer.LemmatizerTrainerME.

We use the -model, -lang, -data, and -encoding arguments, as detailed in the following list:

  • The -model argument specifies the name of the model output file. This is the file that holds the trained model that we will use in the next recipe.
  • The -lang argument specifies the natural language used. In this case, we use en, which indicates the training data is English.
  • The -data argument specifies the file containing the training data. We used the en-lemmatizer.dict file.
  • The -encoding parameter specifies the character set used by the training data. We used UTF-8, which indicates the data is Unicode data.

The output shows the training process. It displays various statistics, such as the number of passes and iterations performed. During each iteration, the probability increases, as shown in the following code. With the 100th iteration, the probability approaches 100.

Performing 100 iterations:

1: ... loglikelihood=-1829041.6775780176 3.317817009120679E-6
2: ... loglikelihood=-452333.43760414346 0.876829361353404
3: ... loglikelihood=-211099.05280473927 0.9506806501594212
4: ... loglikelihood=-132195.3981804198 0.9667554735686108
...
98: ... loglikelihood=-6702.5821153954375 0.9988420818638168
99: ... loglikelihood=-6652.6134177562335 0.998845399680826
100: ... loglikelihood=-6603.518040975329 0.9988553531318534
Writing lemmatizer model ... done (1.274s)

The final part of the output shows where the file is written. We wrote the lemmatizer model to the path :\Downloads\OpenNLP\en-lemmatizer.bin.

There's more...

If you have specialized lemmatization needs, then you will need to create a training file. The training data file consists of a series of lines. Each line consists of three entries separated by spaces. The first entry contains a word. The second entry is the POS tag for the word. The third entry is the lemma for the word.

For example, in en-lemmatizer.dict, there are several lines for variations of the word bump, as shown in the following code:

bump    NN         bump
bump VB bump
bump VBP bump
bumped VBD bump
bumped VBN bump
bumper JJ bumper
bumper NN bumper

As you can see, a word may be used in different contexts and with different suffixes. Other datasets can be used for training. These include the Penn Treebank (https://web.archive.org/web/19970614160127/http://www.cis.upenn.edu/~treebank/) and the CoNLL 2009 datasets (https://www.ldc.upenn.edu/).

Training parameters other than the default parameters can be specified depending on the needs of the problem.

In the next recipe, Determining the lexical meaning of a word using OpenNLP, we will use the model to develop and determine the lexical meaning of a word.

See also

Determining the lexical meaning of a word using OpenNLP

In this recipe, we will use the model we created in the previous recipe to perform lemmatization. We will perform lemmatization on the following sentence:

The girls were leaving the clubhouse for another adventurous afternoon.

In the example, the lemmas for each word in the sentence will be displayed.

Getting ready

To prepare, we need to do the following:

  1. Create a new Maven project
  2. Add the following dependency to the POM file:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.9.0</version>
</dependency>

How to do it...

Let's go through the following steps:

  1. Add the following imports to the project:
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.lemmatizer.LemmatizerME;
import opennlp.tools.lemmatizer.LemmatizerModel;
  1. Add the following try block to the main method. An input stream and model are created, followed by the instantiation of the lemmatization model:
LemmatizerModel lemmatizerModel = null;
try (InputStream modelInputStream = new FileInputStream(
"C:\\Downloads\\OpenNLP\\en-lemmatizer.bin")) {
lemmatizerModel = new LemmatizerModel(modelInputStream);
LemmatizerME lemmatizer = new LemmatizerME(lemmatizerModel);

} catch (FileNotFoundException e) {
// Handle exception
} catch (IOException e) {
// Handle exception
}

  1. Add the following code to the end of the try block. It sets up arrays holding the words of the sample text and their POS tags. It then performs the lemmatization and displays the results:
String[] tokens = new String[] { 
"The", "girls", "were", "leaving", "the",
"clubhouse", "for", "another", "adventurous",
"afternoon", "." };
String[] posTags = new String[] { "DT", "NNS", "VBD",
"VBG", "DT", "NN", "IN", "DT", "JJ", "NN", "." };
String[] lemmas = lemmatizer.lemmatize(tokens, posTags);
for (int i = 0; i < tokens.length; i++) {
System.out.println(tokens[i] + " - " + lemmas[i]);
}
  1. Upon executing the program, you will get the following output that displays each word and then its lemma:
The - the
girls - girl
were - be
leaving - leave
the - the
clubhouse - clubhouse
for - for
another - another
adventurous - adventurous
afternoon - afternoon
. - .

How it works...

We performed lemmatization on the sentence The girls were leaving the clubhouse for another adventurous afternoon. A LemmatizerModel was declared and instantiated from the en-lemmatizer.bin file. A try-with-resources block was used to obtained an input stream for the file, as shown in the following code:

LemmatizerModel lemmatizerModel = null;
try (InputStream modelInputStream = new FileInputStream(
"C:\\Downloads\\OpenNLP\\en-lemmatizer.bin")) {
lemmatizerModel = new LemmatizerModel(modelInputStream);

Next, the lemmatizer was created using the LemmatizerME class, as shown in the following code:

LemmatizerME lemmatizer = new LemmatizerME(lemmatizerModel);

The following sentence was processed, and is represented as an array of strings. We also need an array of POS tags for the lemmatization process to work. This array was defined in parallel with the sentence array. As we will see in Chapter 4, Detecting POS Using Neural Networks, there are often alternative tags that are possible for a sentence. For this example, we used tags generated by the Cognitive Computation Group's online tool at http://cogcomp.org/page/demo_view/pos:

String[] tokens = new String[] { 
"The", "girls", "were", "leaving", "the",
"clubhouse", "for", "another", "adventurous",
"afternoon", "." };
String[] posTags = new String[] { "DT", "NNS", "VBD",
"VBG", "DT", "NN", "IN", "DT", "JJ", "NN", "." };

The lemmatization then occurred, where the lemmatize method uses the two arrays to build an array of lemmas for each word in the sentence, as shown in the following code:

String[] lemmas = lemmatizer.lemmatize(tokens, posTags);

The lemmas are then displayed, as shown in the following code:

for (int i = 0; i < tokens.length; i++) {
System.out.println(tokens[i] + " - " + lemmas[i]);
}

See also

Removing stop words using LingPipe

Normalization is the process of preparing text for subsequent analysis. This is frequently performed once the text has been tokenized. Normalization activities include such tasks as converting the text to lowercase, validating data, inserting missing elements, stemming, lemmatization, and removing stop words.

We have already examined the stemming and lemmatization process in earlier recipes. In this recipe, we will show how stop words can be removed. Stop words are those words that are not always useful. For example, some downstream NLP tasks do not need to have words such as a, the, or and. These types of words are the common words found in a language. Analysis can often be enhanced by removing them from a text.

Getting ready

To prepare, we need to do the following:

  1. Create a new Maven project
  2. Add the following dependency to the POM file:
<dependency>
<groupId>de.julielab</groupId>
<artifactId>aliasi-lingpipe</artifactId>
<version>4.1.0</version>
</dependency>

How to do it...

Let's go through the following steps:

  1. Add the following import statements to your program:
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.tokenizer.EnglishStopTokenizerFactory;
  1. Add the following code to the main method:
String sentence = 
"The blue goose and a quiet lamb stopped to smell the roses.";
TokenizerFactory tokenizerFactory =
IndoEuropeanTokenizerFactory.INSTANCE;
tokenizerFactory =
new EnglishStopTokenizerFactory(tokenizerFactory);
Tokenizer tokenizer =tokenizerFactory.tokenizer(
sentence.toCharArray(), 0, sentence.length());
for (String token : tokenizer) {
System.out.println(token);
}
  1. Execute the program. You will get the following output:
The
blue
goose
quiet
lamb
stopped
smell
roses
.

How it works...

The example started with the declaration of a sample sentence. The program will return a list of words found in the sentence with the stop words removed, as shown in the following code:

String sentence = 
"The blue goose and a quiet lamb stopped to smell the roses.";

An instance of LingPipe's IndoEuropeanTokenizerFactory is used to provide a means of tokenizing the sentence. It is used as the argument to the EnglishStopTokenizerFactory constructor, which provides a stop word tokenizer, as shown in the following code:

TokenizerFactory tokenizerFactory =
IndoEuropeanTokenizerFactory.INSTANCE;
tokenizerFactory = new EnglishStopTokenizerFactory(tokenizerFactory);

The tokenizer method is invoked against the sentence, where its second and third parameters specify which part of the sentence to tokenize. The Tokenizer class represents the tokens extracted from the sentence:

Tokenizer tokenizer = tokenizerFactory.tokenizer(
sentence.toCharArray(), 0, sentence.length());

The Tokenizer class implements the Iterable<String> interface that we utilized in the following for-each statement to display the tokens:

for (String token : tokenizer) {
System.out.println(token);
}

Note that in the following duplicated output, the first word of the sentence, The, was not removed, nor was there a terminating period. Otherwise, common stop words were removed, as shown in the following code:

The
blue
goose
quiet
lamb
stopped
smell
roses
.

See also

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Perform simple-to-complex NLP text processing tasks using modern Java libraries
  • Extract relationships between different text complexities using a problem-solution approach
  • Utilize cloud-based APIs to perform machine translation operations

Description

Natural Language Processing (NLP) has become one of the prime technologies for processing very large amounts of unstructured data from disparate information sources. This book includes a wide set of recipes and quick methods that solve challenges in text syntax, semantics, and speech tasks. At the beginning of the book, you'll learn important NLP techniques, such as identifying parts of speech, tagging words, and analyzing word semantics. You will learn how to perform lexical analysis and use machine learning techniques to speed up NLP operations. With independent recipes, you will explore techniques for customizing your existing NLP engines/models using Java libraries such as OpenNLP and the Stanford NLP library. You will also learn how to use NLP processing features from cloud-based sources, including Google and Amazon Web Services (AWS). You will master core tasks, such as stemming, lemmatization, part-of-speech tagging, and named entity recognition. You will also learn about sentiment analysis, semantic text similarity, language identification, machine translation, and text summarization. By the end of this book, you will be ready to become a professional NLP expert using a problem-solution approach to analyze any sort of text, sentence, or semantic word.

Who is this book for?

This book is for data scientists, NLP engineers, and machine learning developers who want to perform their work on linguistic applications faster with the use of popular libraries on JVM machines. This book will help you build real-world NLP applications using a recipe-based approach. Prior knowledge of Natural Language Processing basics and Java programming is expected.

What you will learn

  • Explore how to use tokenizers in NLP processing
  • Implement NLP techniques in machine learning and deep learning applications
  • Identify sentences within text and learn how to train specialized NER models
  • Learn how to classify documents and perform sentiment analysis
  • Find semantic similarities between text elements and extract text from a variety of sources
  • Preprocess text from a variety of data sources
  • Learn how to identify and translate languages

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Apr 25, 2019
Length: 386 pages
Edition : 1st
Language : English
ISBN-13 : 9781789808834
Vendor :
Oracle
Category :
Languages :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Apr 25, 2019
Length: 386 pages
Edition : 1st
Language : English
ISBN-13 : 9781789808834
Vendor :
Oracle
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 126.97
Natural Language Processing with Java Cookbook
$43.99
Hands-On Design Patterns with Java
$43.99
Learn Java 12 Programming
$38.99
Total $ 126.97 Stars icon
Banner background image

Table of Contents

13 Chapters
Preparing Text for Analysis and Tokenization Chevron down icon Chevron up icon
Isolating Sentences within a Document Chevron down icon Chevron up icon
Performing Name Entity Recognition Chevron down icon Chevron up icon
Detecting POS Using Neural Networks Chevron down icon Chevron up icon
Performing Text Classification Chevron down icon Chevron up icon
Finding Relationships within Text Chevron down icon Chevron up icon
Language Identification and Translation Chevron down icon Chevron up icon
Identifying Semantic Similarities within Text Chevron down icon Chevron up icon
Common Text Processing and Generation Tasks Chevron down icon Chevron up icon
Extracting Data for Use in NLP Analysis Chevron down icon Chevron up icon
Creating a Chatbot Chevron down icon Chevron up icon
Installation and Configuration Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.