You're reading from Natural Language Processing with Java Cookbook Over 70 recipes to create linguistic and language translation applications using Java libraries

Product type Paperback

Published in Apr 2019

Publisher Packt

ISBN-13 9781789801156

Length 386 pages

Edition 1st Edition

Languages

Java

Tools

Deeplearning4j

Concepts

Mobile Application Development

Authors (2):

Richard M. Reese

Richard M Reese

View More author details

Removing stop words using LingPipe

Normalization is the process of preparing text for subsequent analysis. This is frequently performed once the text has been tokenized. Normalization activities include such tasks as converting the text to lowercase, validating data, inserting missing elements, stemming, lemmatization, and removing stop words.

We have already examined the stemming and lemmatization process in earlier recipes. In this recipe, we will show how stop words can be removed. Stop words are those words that are not always useful. For example, some downstream NLP tasks do not need to have words such as a, the, or and. These types of words are the common words found in a language. Analysis can often be enhanced by removing them from a text.

Getting ready

To prepare, we need to do the following:

Create a new Maven project
Add the following dependency to the POM file:

<dependency>
    <groupId>de.julielab</groupId>
    <artifactId>aliasi-lingpipe</artifactId>
    <version>4.1.0</version>
</dependency>

How to do it...

Let's go through the following steps:

Add the following import statements to your program:

import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.tokenizer.EnglishStopTokenizerFactory;

Add the following code to the main method:

String sentence = 
    "The blue goose and a quiet lamb stopped to smell the roses.";
TokenizerFactory tokenizerFactory = 
    IndoEuropeanTokenizerFactory.INSTANCE;
tokenizerFactory = 
    new EnglishStopTokenizerFactory(tokenizerFactory);
Tokenizer tokenizer =tokenizerFactory.tokenizer(
    sentence.toCharArray(), 0, sentence.length());
for (String token : tokenizer) {
    System.out.println(token);
}

Execute the program. You will get the following output:

The
blue
goose
quiet
lamb
stopped
smell
roses
.

How it works...

The example started with the declaration of a sample sentence. The program will return a list of words found in the sentence with the stop words removed, as shown in the following code:

String sentence = 
    "The blue goose and a quiet lamb stopped to smell the roses.";

An instance of LingPipe's IndoEuropeanTokenizerFactory is used to provide a means of tokenizing the sentence. It is used as the argument to the EnglishStopTokenizerFactory constructor, which provides a stop word tokenizer, as shown in the following code:

TokenizerFactory tokenizerFactory =
    IndoEuropeanTokenizerFactory.INSTANCE;
tokenizerFactory = new EnglishStopTokenizerFactory(tokenizerFactory);

The tokenizer method is invoked against the sentence, where its second and third parameters specify which part of the sentence to tokenize. The Tokenizer class represents the tokens extracted from the sentence:

Tokenizer tokenizer = tokenizerFactory.tokenizer(
    sentence.toCharArray(), 0, sentence.length());

The Tokenizer class implements the Iterable<String> interface that we utilized in the following for-each statement to display the tokens:

for (String token : tokenizer) {
    System.out.println(token);
}

Note that in the following duplicated output, the first word of the sentence, The, was not removed, nor was there a terminating period. Otherwise, common stop words were removed, as shown in the following code:

The
blue
goose
quiet
lamb
stopped
smell
roses
.

You're reading from Natural Language Processing with Java Cookbook Over 70 recipes to create linguistic and language translation applications using Java libraries

Table of Contents (14) Chapters

Removing stop words using LingPipe

Getting ready

How to do it...

How it works...

See also

Authors (2)

Other recommended products