Survey of NLP tools

There are many tools available that support NLP. Some of these are available with the Java SE SDK but are limited in their utility for all but the simplest types of problems. Other libraries, such as Apache's OpenNLP and LingPipe, provide extensive and sophisticated support for NLP problems.

Low-level Java support includes string libraries, such as String, StringBuilder, and StringBuffer. These classes possess methods that perform searching, matching, and text-replacement. Regular expressions use special encoding to match substrings. Java provides a rich set of techniques to use regular expressions.

As discussed earlier, tokenizers are used to split text into individual elements. Java provides supports for tokenizers with:

The String class' split method
The StreamTokenizer class
The StringTokenizer class

There also exist a number of NLP libraries/APIs for Java. A partial list of Java-based NLP APIs can be found in the following table. Most of these are open source. In addition, there are a number of commercial APIs available. We will focus on the open source APIs:

API	URL
Apertium	http://www.apertium.org/
General Architecture for Text Engineering	http://gate.ac.uk/
Learning Based Java	https://github.com/CogComp/lbjava
LingPipe	http://alias-i.com/lingpipe/
MALLET	http://mallet.cs.umass.edu/
MontyLingua	http://web.media.mit.edu/~hugo/montylingua/
Apache OpenNLP	http://opennlp.apache.org/
UIMA	http://uima.apache.org/
Stanford Parser	http://nlp.stanford.edu/software
Apache Lucene Core	https://lucene.apache.org/core/
Snowball	http://snowballstem.org/

Many of these NLP tasks are combined to form a pipeline. A pipeline consists of various NLP tasks, which are integrated into a series of steps to achieve a processing goal. Examples of frameworks that support pipelines are General Architecture for Text Engineering (GATE) and Apache UIMA.

In the next section, we will cover several NLP APIs in more depth. A brief overview of their capabilities will be presented along with a list of useful links for each API.

Apache OpenNLP

The Apache OpenNLP project is a machine-learning-based tool kit for processing natural-language text; it addresses common NLP tasks and will be used throughout this book. It consists of several components that perform specific tasks, permit models to be trained, and support for testing the models. The general approach, used by OpenNLP, is to instantiate a model that supports the task from a file and then executes methods against the model to perform a task.

For example, in the following sequence, we will tokenize a simple string. For this code to execute properly, it must handle the FileNotFoundException and IOException exceptions. We use a try-with-resource block to open a FileInputStream instance using the en-token.bin file. This file contains a model that has been trained using English text:

try (InputStream is = new FileInputStream( 
        new File(getModelDir(), "en-token.bin"))){ 
    // Insert code to tokenize the text 
} catch (FileNotFoundException ex) { 
    ... 
} catch (IOException ex) { 
    ... 
}

An instance of the TokenizerModel class is then created using this file inside the try block. Next, we create an instance of the Tokenizer class, as shown here:

TokenizerModel model = new TokenizerModel(is); 
Tokenizer tokenizer = new TokenizerME(model);

The tokenize method is then applied, whose argument is the text to be tokenized. The method returns an array of String objects:

String tokens[] = tokenizer.tokenize("He lives at 1511 W." 
  + "Randolph.");

A for-each statement displays the tokens, as shown here. The open and closed brackets are used to clearly identify the tokens:

for (String a : tokens) { 
  System.out.print("[" + a + "] "); 
} 
System.out.println();

When we execute this, we will get the following output:

[He] [lives] [at] [1511] [W.] [Randolph] [.]

In this case, the tokenizer recognized that W. was an abbreviation and that the last period was a separate token demarking the end of the sentence.

We will use the OpenNLP API for many of the examples in this book. OpenNLP links are listed in the following table:

OpenNLP	Website
Home	https://opennlp.apache.org/
Documentation	https://opennlp.apache.org/docs/
Javadoc	http://nlp.stanford.edu/nlp/javadoc/javanlp/index.html
Download	https://opennlp.apache.org/cgi-bin/download.cgi
Wiki	https://cwiki.apache.org/confluence/display/OPENNLP/Index%3bjsessionid=32B408C73729ACCCDD071D9EC354FC54

Stanford NLP

The Stanford NLP Group conducts NLP research and provides tools for NLP tasks. The Stanford CoreNLP is one of these toolsets. In addition, there are other toolsets, such as the Stanford Parser, Stanford POS tagger, and the Stanford Classifier. The Stanford tools support English and Chinese languages and basic NLP tasks, including tokenization and name-entity recognition.

These tools are released under the full GPL, but it does not allow them to be used in commercial applications, though a commercial license is available. The API is well-organized and supports the core NLP functionality.

There are several tokenization approaches supported by the Stanford group. We will use the PTBTokenizer class to illustrate the use of this NLP library. The constructor demonstrated here uses a Reader object, a LexedTokenFactory<T> argument, and a string to specify which of the several options is to be used.

LexedTokenFactory is an interface that is implemented by the CoreLabelTokenFactory and WordTokenFactory classes. The former class supports the retention of the beginning and ending character positions of a token, whereas the latter class simply returns a token as a string without any positional information. The WordTokenFactory class is used by default.

The CoreLabelTokenFactory class is used in the following example. A StringReader is created using a string. The last argument is used for the option parameter, which is null for this example. The Iterator interface is implemented by the PTBTokenizer class, allowing us to use the hasNext and next methods to display the tokens:

PTBTokenizer ptb = new PTBTokenizer( 
new StringReader("He lives at 1511 W. Randolph."), 
new CoreLabelTokenFactory(), null); 
while (ptb.hasNext()) { 
  System.out.println(ptb.next()); 
}

The output is as follows:

He
lives
at
1511
W.
Randolph
.

We will use the Stanford NLP library extensively in this book. A list of Stanford links is found in the following table. Documentation and download links are found in each of the distributions:

Stanford NLP	Website
Home	http://nlp.stanford.edu/index.shtml
CoreNLP	http://nlp.stanford.edu/software/corenlp.shtml#Download
Parser	http://nlp.stanford.edu/software/lex-parser.shtml
POS Tagger	http://nlp.stanford.edu/software/tagger.shtml
java-nlp-user mailing list	https://mailman.stanford.edu/mailman/listinfo/java-nlp-user

LingPipe

LingPipe consists of a set of tools to perform common NLP tasks. It supports model training and testing. There are both royalty-free and licensed versions of the tool. The production use of the free version is limited.

To demonstrate the use of LingPipe, we will illustrate how it can be used to tokenize text using the Tokenizer class. Start by declaring two lists, one to hold the tokens and a second to hold the whitespace:

List<String> tokenList = new ArrayList<>(); 
List<String> whiteList = new ArrayList<>();

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.

Next, declare a string to hold the text to be tokenized:

String text = "A sample sentence processed \nby \tthe " + 
    "LingPipe tokenizer.";

Now, create an instance of the Tokenizer class. As shown in the following code block, a static tokenizer method is used to create an instance of the Tokenizer class based on an Indo-European factory class:

Tokenizer tokenizer = IndoEuropeanTokenizerFactory.INSTANCE. 
tokenizer(text.toCharArray(), 0, text.length());

The tokenize method of this class is then used to populate the two lists:

tokenizer.tokenize(tokenList, whiteList);

Use a for-each statement to display the tokens:

for(String element : tokenList) { 
  System.out.print(element + " "); 
} 
System.out.println();

The output of this example is shown here:

A sample sentence processed by the LingPipe tokenizer

A list of LingPipe links can be found in the following table:

LingPipe	Website
Home	http://alias-i.com/lingpipe/index.html
Tutorials	http://alias-i.com/lingpipe/demos/tutorial/read-me.html
JavaDocs	http://alias-i.com/lingpipe/docs/api/index.html
Download	http://alias-i.com/lingpipe/web/install.html
Core	http://alias-i.com/lingpipe/web/download.html
Models	http://alias-i.com/lingpipe/web/models.html

GATE

GATE is a set of tools written in Java and developed at the University of Sheffield in England. It supports many NLP tasks and languages. It can also be used as a pipeline for NLP-processing. It supports an API along with GATE Developer, a document viewer that displays text along with annotations. This is useful for examining a document using highlighted annotations. GATE Mimir, a tool for indexing and searching text generated by various sources, is also available. Using GATE for many NLP tasks involves a bit of code. GATE Embedded is used to embed GATE functionality directly in the code. Useful GATE links are listed in the following table:

Gate	Website
Home	https://gate.ac.uk/
Documentation	https://gate.ac.uk/documentation.html
JavaDocs	http://jenkins.gate.ac.uk/job/GATE-Nightly/javadoc/
Download	https://gate.ac.uk/download/
Wiki	http://gatewiki.sf.net/

TwitIE is an open source GATE pipeline for information-extraction over tweets. It contains the following:

Social media data-language identification
Twitter tokenizer for handling smileys, username, URLs, and so on
POS tagger
Text-normalization

It is available as part of the GATE Twitter plugin. The following table lists the required links:

TwitIE

Website

Home

https://gate.ac.uk/wiki/twitie.html

Documentation

https://gate.ac.uk/sale/ranlp2013/twitie/twitie-ranlp2013.pdf?m=1

UIMA

The Organization for the Advancement of Structured Information Standards (OASIS) is a consortium focused on information-oriented business technologies. It developed the Unstructured Information Management Architecture (UIMA) standard as a framework for NLP pipelines. It is supported by Apache UIMA.

Although it supports pipeline creation, it also describes a series of design patterns, data representations, and user roles for the analysis of text. UIMA links are listed in the following table:

Apache UIMA	Website
Home	https://uima.apache.org/
Documentation	https://uima.apache.org/documentation.html
JavaDocs	https://uima.apache.org/d/uimaj-2.6.0/apidocs/index.html
Download	https://uima.apache.org/downloads.cgi
Wiki	https://cwiki.apache.org/confluence/display/UIMA/Index

Apache Lucene Core

Apache Lucene Core is an open source library for full-featured text search engines written in Java. It uses tokenization for breaking text into small chunks for indexing elements. It also provide pre- and post-tokenization options for analysis purposes. It supports stemming, filtering, text-normalization, and synonym-expansion after tokenization. When used, it creates a directory and index files, and can be used to search the contents. It cannot be taken as an NLP toolkit, but it provides powerful tools for working with text and advanced string-manipulation with tokenization. It provides a free search engine. The following table list the important links for Apache Lucene:

Apache Lucene	Website
Home	http://lucene.apache.org/
Documentation	http://lucene.apache.org/core/documentation.html
JavaDocs	http://lucene.apache.org/core/7_3_0/core/index.html
Download	http://lucene.apache.org/core/mirrors-core-latest-redir.html?