Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Natural Language Processing with Java

You're reading from   Natural Language Processing with Java Techniques for building machine learning and neural network models for NLP

Arrow left icon
Product type Paperback
Published in Jul 2018
Publisher
ISBN-13 9781788993494
Length 318 pages
Edition 2nd Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Ashish Bhatia Ashish Bhatia
Author Profile Icon Ashish Bhatia
Ashish Bhatia
Richard M. Reese Richard M. Reese
Author Profile Icon Richard M. Reese
Richard M. Reese
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Introduction to NLP FREE CHAPTER 2. Finding Parts of Text 3. Finding Sentences 4. Finding People and Things 5. Detecting Part of Speech 6. Representing Text with Features 7. Information Retrieval 8. Classifying Texts and Documents 9. Topic Modeling 10. Using Parsers to Extract Relationships 11. Combined Pipeline 12. Creating a Chatbot 13. Other Books You May Enjoy

Preparing data

An important step in NLP is finding and preparing the data for processing. This includes the data for training purposes and the data that needs to be processed. There are several factors that need to be considered. Here, we will focus on the support Java provides for working with characters.

We need to consider how characters are represented. Although we will deal primarily with English text, other languages present unique problems. Not only are there differences in how a character can be encoded, the order in which text is read will vary. For example, Japanese orders its text in columns going from right to left.

There are also a number of possible encodings. These include ASCII, Latin, and Unicode to mention a few. A more complete list is found in the following table. Unicode, in particular, is a complex and extensive encoding scheme:

Encoding

Description

ASCII

A character-encoding using 128 (0-127) values.

Latin

There are several Latin variations that uses 256 values. They include various combination of the umlaut, and other characters. Different versions of Latin have been introduced to address various Indo-European languages, such as Turkish and Esperanto.

Big5

A two-byte encoding to address the Chinese character set.

Unicode

There are three encodings for Unicode: UTF-8, UTF-16, and UTF-32. These use 1, 2, and 4 bytes, respectively. This encoding is able to represent all known languages in existence today, including newer languages, such as Klingon and Elvish.

 

Java is capable of handling these encoding schemes. The javac executable's -encoding command-line option is used to specify the encoding scheme to use. In the following command line, the Big5 encoding scheme is specified:

javac -encoding Big5

Character-processing is supported using the primitive char data type, the Character class, and several other classes and interfaces, as summarized in the following table:

Character type

Description

char

Primitive data type.

Character

Wrapper class for char.

CharBuffer

This class supports a buffer of char, providing methods for get/put characters or a sequence of characters operations.

CharSequence

An interface implemented by CharBuffer, Segment, String, StringBuffer, and StringBuilder. It supports read-only access to a sequence of chars.

 

Java also provides a number of classes and interfaces to support strings. These are summarized in the following table. We will use these in many of our examples. The String, StringBuffer, and StringBuilder classes provide similar string-processing capabilities but differ in whether they can be modified and whether they are thread-safe. The CharacterIterator interface and the StringCharacterIterator class provide techniques to traverse character sequences.

The Segment class represents a fragment of text:

Class/interface

Description

String

An immutable string.

StringBuffer

Represents a modifiable string. It is thread-safe.

StringBuilder

Compatible with the StringBuffer class but is
not thread-safe.

Segment

Represents a fragment of text in a character array.
It provides rapid access to character data in an array.

CharacterIterator

Defines an iterator for text. It supports a bidirectional traversal of text.

StringCharacterIterator

A class that implements the CharacterIterator interface for a String.

 

We also need to consider the file format if we are reading from a file. Often, data is obtained from sources where the words are annotated. For example, if we use a web page as the source of text, we will find that it is marked up with HTML tags. These are not necessarily relevant to the analysis process and may need to be removed.

The Multipurpose Internet Mail Extensions (MIME) type is used to characterize the format used by a file. Common file types are listed in the following table. Either we need to explicitly remove or alter the markup found in a file, or use specialized software to deal with it. Some of the NLP APIs provide tools to deal with specialized file formats:

File format

MIME type

Description

Text

Plain/text

Simple text file

Office type Document

Application/MS Word

application/vnd.oasis.opendocument.text

Microsoft Office

Open Office

PDF

Application/PDF

Adobe Portable Document Format

HTML

Text/HTML

Web pages

XML

Text/XML

eXtensible Markup Language

Database

Not applicable

Data can be in a number of different formats

 

Many of the NLP APIs assume that the data is clean. When it is not, it needs to be cleaned, lest we get unreliable and misleading results.

You have been reading a chapter from
Natural Language Processing with Java - Second Edition
Published in: Jul 2018
Publisher:
ISBN-13: 9781788993494
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image