Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Natural Language Processing with Java

You're reading from   Natural Language Processing with Java Techniques for building machine learning and neural network models for NLP

Arrow left icon
Product type Paperback
Published in Jul 2018
Publisher
ISBN-13 9781788993494
Length 318 pages
Edition 2nd Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Ashish Bhatia Ashish Bhatia
Author Profile Icon Ashish Bhatia
Ashish Bhatia
Richard M. Reese Richard M. Reese
Author Profile Icon Richard M. Reese
Richard M. Reese
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Introduction to NLP FREE CHAPTER 2. Finding Parts of Text 3. Finding Sentences 4. Finding People and Things 5. Detecting Part of Speech 6. Representing Text with Features 7. Information Retrieval 8. Classifying Texts and Documents 9. Topic Modeling 10. Using Parsers to Extract Relationships 11. Combined Pipeline 12. Creating a Chatbot 13. Other Books You May Enjoy

Survey of NLP tools

There are many tools available that support NLP. Some of these are available with the Java SE SDK but are limited in their utility for all but the simplest types of problems. Other libraries, such as Apache's OpenNLP and LingPipe, provide extensive and sophisticated support for NLP problems.

Low-level Java support includes string libraries, such as String, StringBuilder, and StringBuffer. These classes possess methods that perform searching, matching, and text-replacement. Regular expressions use special encoding to match substrings. Java provides a rich set of techniques to use regular expressions.

As discussed earlier, tokenizers are used to split text into individual elements. Java provides supports for tokenizers with:

  • The String class' split method
  • The StreamTokenizer class
  • The StringTokenizer class

There also exist a number of NLP libraries/APIs for Java. A partial list of Java-based NLP APIs can be found in the following table. Most of these are open source. In addition, there are a number of commercial APIs available. We will focus on the open source APIs:

API

URL

Apertium

http://www.apertium.org/

General Architecture for Text Engineering

http://gate.ac.uk/

Learning Based Java

https://github.com/CogComp/lbjava

LingPipe

http://alias-i.com/lingpipe/

MALLET

http://mallet.cs.umass.edu/

MontyLingua

http://web.media.mit.edu/~hugo/montylingua/

Apache OpenNLP

http://opennlp.apache.org/

UIMA

http://uima.apache.org/

Stanford Parser

http://nlp.stanford.edu/software

Apache Lucene Core

https://lucene.apache.org/core/

Snowball

http://snowballstem.org/

 

Many of these NLP tasks are combined to form a pipeline. A pipeline consists of various NLP tasks, which are integrated into a series of steps to achieve a processing goal. Examples of frameworks that support pipelines are General Architecture for Text Engineering (GATE) and Apache UIMA.

In the next section, we will cover several NLP APIs in more depth. A brief overview of their capabilities will be presented along with a list of useful links for each API.

Apache OpenNLP

The Apache OpenNLP project is a machine-learning-based tool kit for processing natural-language text; it addresses common NLP tasks and will be used throughout this book. It consists of several components that perform specific tasks, permit models to be trained, and support for testing the models. The general approach, used by OpenNLP, is to instantiate a model that supports the task from a file and then executes methods against the model to perform a task.

For example, in the following sequence, we will tokenize a simple string. For this code to execute properly, it must handle the FileNotFoundException and IOException exceptions. We use a try-with-resource block to open a FileInputStream instance using the en-token.bin file. This file contains a model that has been trained using English text:

try (InputStream is = new FileInputStream( 
        new File(getModelDir(), "en-token.bin"))){ 
    // Insert code to tokenize the text 
} catch (FileNotFoundException ex) { 
    ... 
} catch (IOException ex) { 
    ... 
} 

An instance of the TokenizerModel class is then created using this file inside the try block. Next, we create an instance of the Tokenizer class, as shown here:

TokenizerModel model = new TokenizerModel(is); 
Tokenizer tokenizer = new TokenizerME(model); 

The tokenize method is then applied, whose argument is the text to be tokenized. The method returns an array of String objects:

String tokens[] = tokenizer.tokenize("He lives at 1511 W." 
+ "Randolph.");

A for-each statement displays the tokens, as shown here. The open and closed brackets are used to clearly identify the tokens:

for (String a : tokens) { 
  System.out.print("[" + a + "] "); 
} 
System.out.println(); 

When we execute this, we will get the following output:

[He] [lives] [at] [1511] [W.] [Randolph] [.]  

In this case, the tokenizer recognized that W. was an abbreviation and that the last period was a separate token demarking the end of the sentence.

We will use the OpenNLP API for many of the examples in this book. OpenNLP links are listed in the following table:

OpenNLP

Website

Home



https://opennlp.apache.org/

Documentation


https://opennlp.apache.org/docs/

Javadoc



http://nlp.stanford.edu/nlp/javadoc/javanlp/index.html

Download



https://opennlp.apache.org/cgi-bin/download.cgi

Wiki



https://cwiki.apache.org/confluence/display/OPENNLP/Index%3bjsessionid=32B408C73729ACCCDD071D9EC354FC54

Stanford NLP

The Stanford NLP Group conducts NLP research and provides tools for NLP tasks. The Stanford CoreNLP is one of these toolsets. In addition, there are other toolsets, such as the Stanford Parser, Stanford POS tagger, and the Stanford Classifier. The Stanford tools support English and Chinese languages and basic NLP tasks, including tokenization and name-entity recognition.

These tools are released under the full GPL, but it does not allow them to be used in commercial applications, though a commercial license is available. The API is well-organized and supports the core NLP functionality.

There are several tokenization approaches supported by the Stanford group. We will use the PTBTokenizer class to illustrate the use of this NLP library. The constructor demonstrated here uses a Reader object, a LexedTokenFactory<T> argument, and a string to specify which of the several options is to be used.

LexedTokenFactory is an interface that is implemented by the CoreLabelTokenFactory and WordTokenFactory classes. The former class supports the retention of the beginning and ending character positions of a token, whereas the latter class simply returns a token as a string without any positional information. The WordTokenFactory class is used by default.

The CoreLabelTokenFactory class is used in the following example. A StringReader is created using a string. The last argument is used for the option parameter, which is null for this example. The Iterator interface is implemented by the PTBTokenizer class, allowing us to use the hasNext and next methods to display the tokens:

PTBTokenizer ptb = new PTBTokenizer( 
new StringReader("He lives at 1511 W. Randolph."), 
new CoreLabelTokenFactory(), null); 
while (ptb.hasNext()) { 
  System.out.println(ptb.next()); 
} 

The output is as follows:

He
lives
at
1511
W.
Randolph
.  

We will use the Stanford NLP library extensively in this book. A list of Stanford links is found in the following table. Documentation and download links are found in each of the distributions:

Stanford NLP

Website

Home



http://nlp.stanford.edu/index.shtml

CoreNLP



http://nlp.stanford.edu/software/corenlp.shtml#Download

Parser



http://nlp.stanford.edu/software/lex-parser.shtml

POS Tagger



http://nlp.stanford.edu/software/tagger.shtml

java-nlp-user mailing list



https://mailman.stanford.edu/mailman/listinfo/java-nlp-user

LingPipe

LingPipe consists of a set of tools to perform common NLP tasks. It supports model training and testing. There are both royalty-free and licensed versions of the tool. The production use of the free version is limited.

To demonstrate the use of LingPipe, we will illustrate how it can be used to tokenize text using the Tokenizer class. Start by declaring two lists, one to hold the tokens and a second to hold the whitespace:

List<String> tokenList = new ArrayList<>(); 
List<String> whiteList = new ArrayList<>(); 
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.

Next, declare a string to hold the text to be tokenized:

String text = "A sample sentence processed \nby \tthe " + 
    "LingPipe tokenizer."; 

Now, create an instance of the Tokenizer class. As shown in the following code block, a static tokenizer method is used to create an instance of the Tokenizer class based on an Indo-European factory class:

Tokenizer tokenizer = IndoEuropeanTokenizerFactory.INSTANCE. 
tokenizer(text.toCharArray(), 0, text.length()); 

The tokenize method of this class is then used to populate the two lists:

tokenizer.tokenize(tokenList, whiteList); 

Use a for-each statement to display the tokens:

for(String element : tokenList) { 
  System.out.print(element + " "); 
} 
System.out.println(); 

The output of this example is shown here:

A sample sentence processed by the LingPipe tokenizer

A list of LingPipe links can be found in the following table:

LingPipe

Website

Home



http://alias-i.com/lingpipe/index.html

Tutorials



http://alias-i.com/lingpipe/demos/tutorial/read-me.html

JavaDocs



http://alias-i.com/lingpipe/docs/api/index.html

Download



http://alias-i.com/lingpipe/web/install.html

Core



http://alias-i.com/lingpipe/web/download.html

Models



http://alias-i.com/lingpipe/web/models.html

GATE

GATE is a set of tools written in Java and developed at the University of Sheffield in England. It supports many NLP tasks and languages. It can also be used as a pipeline for NLP-processing. It supports an API along with GATE Developer, a document viewer that displays text along with annotations. This is useful for examining a document using highlighted annotations. GATE Mimir, a tool for indexing and searching text generated by various sources, is also available. Using GATE for many NLP tasks involves a bit of code. GATE Embedded is used to embed GATE functionality directly in the code. Useful GATE links are listed in the following table:

Gate

Website

Home



https://gate.ac.uk/

Documentation



https://gate.ac.uk/documentation.html

JavaDocs



http://jenkins.gate.ac.uk/job/GATE-Nightly/javadoc/

Download



https://gate.ac.uk/download/

Wiki



http://gatewiki.sf.net/

 

TwitIE is an open source GATE pipeline for information-extraction over tweets. It contains the following:

  • Social media data-language identification
  • Twitter tokenizer for handling smileys, username, URLs, and so on
  • POS tagger
  • Text-normalization

It is available as part of the GATE Twitter plugin. The following table lists the required links:

TwitIE Website
Home

https://gate.ac.uk/wiki/twitie.html

Documentation

 

https://gate.ac.uk/sale/ranlp2013/twitie/twitie-ranlp2013.pdf?m=1

UIMA

The Organization for the Advancement of Structured Information Standards (OASIS) is a consortium focused on information-oriented business technologies. It developed the Unstructured Information Management Architecture (UIMA) standard as a framework for NLP pipelines. It is supported by Apache UIMA.

Although it supports pipeline creation, it also describes a series of design patterns, data representations, and user roles for the analysis of text. UIMA links are listed in the following table:

Apache UIMA

Website

Home



https://uima.apache.org/

Documentation



https://uima.apache.org/documentation.html

JavaDocs



https://uima.apache.org/d/uimaj-2.6.0/apidocs/index.html

Download



https://uima.apache.org/downloads.cgi

Wiki



https://cwiki.apache.org/confluence/display/UIMA/Index

Apache Lucene Core

Apache Lucene Core is an open source library for full-featured text search engines written in Java. It uses tokenization for breaking text into small chunks for indexing elements. It also provide pre- and post-tokenization options for analysis purposes. It supports stemming, filtering, text-normalization, and synonym-expansion after tokenization. When used, it creates a directory and index files, and can be used to search the contents. It cannot be taken as an NLP toolkit, but it provides powerful tools for working with text and advanced string-manipulation with tokenization. It provides a free search engine. The following table list the important links for Apache Lucene:

Apache Lucene

Website

Home

http://lucene.apache.org/

Documentation

http://lucene.apache.org/core/documentation.html

JavaDocs

http://lucene.apache.org/core/7_3_0/core/index.html

Download

http://lucene.apache.org/core/mirrors-core-latest-redir.html?

You have been reading a chapter from
Natural Language Processing with Java - Second Edition
Published in: Jul 2018
Publisher:
ISBN-13: 9781788993494
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime