You're reading from Natural Language Processing with Java Explore various approaches to organize and extract useful text from unstructured data using Java

Product type Paperback

Published in Mar 2015

Publisher

ISBN-13 9781784391799

Length 262 pages

Edition 1st Edition

Languages

Java

Concepts

Data Analysis

Authors (2):

Richard M. Reese

Richard M Reese

View More author details

Table of Contents (10) Chapters

Preface

1. Introduction to NLP

2. Finding Parts of Text FREE CHAPTER

3. Finding Sentences

4. Finding People and Things

5. Detecting Part of Speech

6. Classifying Texts and Documents

7. Using Parser to Extract Relationships

8. Combined Approaches

Index

Survey of NLP tools

There are many tools available that support NLP. Some of these are available with the Java SE SDK but are limited in their utility for all but the simplest types of problems. Other libraries such as Apache's OpenNLP and LingPipe provide extensive and sophisticated support for NLP problems.

Low-level Java support includes string libraries, such as String, StringBuilder, and StringBuffer. These classes possess methods that perform searching, matching, and text replacement. Regular expressions use special encoding to match substrings. Java provides a rich set of techniques to use regular expressions.

As discussed earlier, tokenizers are used to split text into individual elements. Java provides supports for tokenizers with:

The String class' split method
The StreamTokenizer class
The StringTokenizer class

There also exists a number of NLP libraries/APIs for Java. A partial list of Java-based NLP APIs are found in the following table. Most of these are open source. In addition, there are a number of commercial APIs available. We will focus on the open source APIs:

API	URL
Apertium	http://www.apertium.org/
General Architecture for Text Engineering	http://gate.ac.uk/
Learning Based Java	http://cogcomp.cs.illinois.edu/page/software_view/LBJ
LinguaStream	http://www.linguastream.org/
LingPipe	http://alias-i.com/lingpipe/
Mallet	http://mallet.cs.umass.edu/
MontyLingua	http://web.media.mit.edu/~hugo/montylingua/
Apache OpenNLP	http://opennlp.apache.org/
UIMA	http://uima.apache.org/
Stanford Parser	http://nlp.stanford.edu/software

Many of these NLP tasks are combined to form a pipeline. A pipeline consists of various NLP tasks, which are integrated into a series of steps to achieve some processing goal. Examples of frameworks that support pipelines are GATE and Apache UIMA.

In the next section, we will coverer several NLP APIs in more depth. A brief overview of their capabilities will be presented along with a list of useful links for each API.

Apache OpenNLP

The Apache OpenNLP project addresses common NLP tasks and will be used throughout this book. It consists of several components that perform specific tasks, permit models to be trained, and support for testing the models. The general approach, used by OpenNLP, is to instantiate a model that supports the task from a file and then executes methods against the model to perform a task.

For example, in the following sequence, we will tokenize a simple string. For this code to execute properly, it must handle the FileNotFoundException and IOException exceptions. We use a try-with-resource block to open a FileInputStream instance using the en-token.bin file. This file contains a model that has been trained using English text:

try (InputStream is = new FileInputStream(
        new File(getModelDir(), "en-token.bin"))){
    // Insert code to tokenize the text
} catch (FileNotFoundException ex) {
    …
} catch (IOException ex) {
    …
}

An instance of the TokenizerModel class is then created using this file inside the try block. Next, we create an instance of the Tokenizer class, as shown here:

TokenizerModel model = new TokenizerModel(is);
Tokenizer tokenizer = new TokenizerME(model);

The tokenize method is then applied, whose argument is the text to be tokenized. The method returns an array of String objects:

String tokens[] = tokenizer.tokenize("He lives at 1511 W." + "Randolph.");

A for-each statement displays the tokens as shown here. The open and close brackets are used to clearly identify the tokens:

for (String a : tokens) {
  System.out.print("[" + a + "] ");
}
System.out.println();

When we execute this, we will get output as shown here:

[He] [lives] [at] [1511] [W.] [Randolph] [.]

In this case, the tokenizer recognized that W. was an abbreviation and that the last period was a separate token demarking the end of the sentence.

We will use the OpenNLP API for many of the examples in this book. OpenNLP links are listed in the following table:

OpenNLP	Website
Home	https://opennlp.apache.org/
Documentation	https://opennlp.apache.org/documentation.html
Javadoc	http://nlp.stanford.edu/nlp/javadoc/javanlp/index.html
Download	https://opennlp.apache.org/cgi-bin/download.cgi
Wiki	https://cwiki.apache.org/confluence/display/OPENNLP/Index%3bjsessionid=32B408C73729ACCCDD071D9EC354FC54

Stanford NLP

The Stanford NLP Group conducts NLP research and provides tools for NLP tasks. The Stanford CoreNLP is one of these toolsets. In addition, there are other tool sets such as the Stanford Parser, Stanford POS tagger, and the Stanford Classifier. The Stanford tools support English and Chinese languages and basic NLP tasks, including tokenization and name entity recognition.

These tools are released under the full GPL but it does not allow them to be used in commercial applications, though a commercial license is available. The API is well organized and supports the core NLP functionality.

There are several tokenization approaches supported by the Stanford group. We will use the PTBTokenizer class to illustrate the use of this NLP library. The constructor demonstrated here uses a Reader object, a LexedTokenFactory<T> argument, and a string to specify which of the several options is to be used.

The LexedTokenFactory is an interface that is implemented by the CoreLabelTokenFactory and WordTokenFactory classes. The former class supports the retention of the beginning and ending character positions of a token, whereas the latter class simply returns a token as a string without any positional information. The WordTokenFactory class is used by default.

The CoreLabelTokenFactory class is used in the following example. A StringReader is created using a string. The last argument is used for the option parameter, which is null for this example. The Iterator interface is implemented by the PTBTokenizer class allowing us to use the hasNext and next methods to display the tokens:

PTBTokenizer ptb = new PTBTokenizer(
new StringReader("He lives at 1511 W. Randolph."),
new CoreLabelTokenFactory(), null);
while (ptb.hasNext()) {
  System.out.println(ptb.next());
}

The output is as follows:

He
lives
at
1511
W.
Randolph
.

We will use the Stanford NLP library extensively in this book. A list of Stanford links is found in the following table. Documentation and download links are found in each of the distributions:

Stanford NLP	Website
Home	http://nlp.stanford.edu/index.shtml
CoreNLP	http://nlp.stanford.edu/software/corenlp.shtml#Download
Parser	http://nlp.stanford.edu/software/lex-parser.shtml
POS Tagger	http://nlp.stanford.edu/software/tagger.shtml
java-nlp-user Mailing List	https://mailman.stanford.edu/mailman/listinfo/java-nlp-user

LingPipe

LingPipe consists of a set of tools to perform common NLP tasks. It supports model training and testing. There are both royalty free and license versions of the tool. The production use of the free version is limited.

To demonstrate the use of LingPipe, we will illustrate how it can be used to tokenize text using the Tokenizer class. Start by declaring two lists, one to hold the tokens and a second to hold the whitespace:

List<String> tokenList = new ArrayList<>();
List<String> whiteList = new ArrayList<>();

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Next, declare a string to hold the text to be tokenized:

String text = "A sample sentence processed \nby \tthe " +
    "LingPipe tokenizer.";

Now, create an instance of the Tokenizer class. As shown in the following code block, a static tokenizer method is used to create an instance of the Tokenizer class based on a Indo-European factory class:

Tokenizer tokenizer = IndoEuropeanTokenizerFactory.INSTANCE.
tokenizer(text.toCharArray(), 0, text.length());

The tokenize method of this class is then used to populate the two lists:

tokenizer.tokenize(tokenList, whiteList);

Use a for-each statement to display the tokens:

for(String element : tokenList) {
  System.out.print(element + " ");
}
System.out.println();

The output of this example is shown here:

A sample sentence processed by the LingPipe tokenizer

A list of LingPipe links can be found in the following table:

LingPipe	Website
Home	http://alias-i.com/lingpipe/index.html
Tutorials	http://alias-i.com/lingpipe/demos/tutorial/read-me.html
JavaDocs	http://alias-i.com/lingpipe/docs/api/index.html
Download	http://alias-i.com/lingpipe/web/install.html
Core	http://alias-i.com/lingpipe/web/download.html
Models	http://alias-i.com/lingpipe/web/models.html

GATE

General Architecture for Text Engineering (GATE) is a set of tools written in Java and developed at the University of Sheffield in England. It supports many NLP tasks and languages. It can also be used as a pipeline for NLP processing.

It supports an API along with GATE Developer, a document viewer that displays text along with annotations. This is useful for examining a document using highlighted annotations. GATE Mimir, a tool for indexing and searching text generated by various sources, is also available. Using GATE for many NLP tasks involves a bit of code. GATE Embedded is used to embed GATE functionality directly in code. Useful GATE links are listed in the following table:

Gate	Website
Home	https://gate.ac.uk/
Documentation	https://gate.ac.uk/documentation.html
JavaDocs	http://jenkins.gate.ac.uk/job/GATE-Nightly/javadoc/
Download	https://gate.ac.uk/download/
Wiki	http://gatewiki.sf.net/

UIMA

The Organization for the Advancement of Structured Information Standards (OASIS) is a consortium focused on information-oriented business technologies. It developed the Unstructured Information Management Architecture (UIMA) standard as a framework for NLP pipelines. It is supported by the Apache UIMA.

Although it supports pipeline creation, it also describes a series of design patterns, data representations, and user roles for the analysis of text. UIMA links are listed in the following table:

Apache UIMA	Website
Home	https://uima.apache.org/
Documentation	https://uima.apache.org/documentation.html
JavaDocs	https://uima.apache.org/d/uimaj-2.6.0/apidocs/index.html
Download	https://uima.apache.org/downloads.cgi
Wiki	https://cwiki.apache.org/confluence/display/UIMA/Index

The rest of the chapter is locked