Natural Language Processing with Java: Explore various approaches to organize and extract useful text from unstructured data using Java

Richard M Reese

Richard M. Reese

Free Trial

4.6 (7 Ratings)

Paperback Mar 2015 262 pages 1st Edition

eBook

Can$12.99 ~~Can$49.99~~

Richard M Reese

Richard M. Reese

Free Trial

4.6 (7 Ratings)

Paperback Mar 2015 262 pages 1st Edition

eBook

Can$12.99 ~~Can$49.99~~

Can$12.99 ~~Can$49.99~~

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

View table of contents

Preview Book

Natural Language Processing with Java

Chapter 2. Finding Parts of Text

Finding parts of text is concerned with breaking text down into individual units called tokens, and optionally performing additional processing on these tokens. This additional processing can include stemming, lemmatization, stopword removal, synonym expansion, and converting text to lowercase.

We will demonstrate several tokenization techniques found in the standard Java distribution. These are included because sometimes this is all you may need to do the job. There may be no need to import NLP libraries in this situation. However, these techniques are limited. This is followed by a discussion of specific tokenizers or tokenization approaches supported by NLP APIs. These examples will provide a reference for how the tokenizers are used and the type of output they produce. This is followed by a simple comparison of the differences between the approaches.

There are many specialized tokenizers. For example, the Apache Lucene project supports tokenizers...

What is tokenization?

Tokenization is the process of breaking text down into simpler units. For most text, we are concerned with isolating words. Tokens are split based on a set of delimiters. These delimiters are frequently whitespace characters. Whitespace in Java is defined by the Character class' isWhitespace method. These characters are listed in the following table. However, there may be a need at times to use a different set of delimiters. For example, different delimiters can be useful when whitespace delimiters obscure text breaks, such as paragraph boundaries, and detecting these text breaks is important.

Character	Meaning
Unicode space character	(space_separator, line_separator, or paragraph_separator)
`\t`	U+0009 horizontal tabulation
`\n`	U+000A line feed
`\u000B`	U+000B vertical tabulation
`\f`	U+000C form feed
`\r`	U+000D carriage return
`\u001C`	U+001C file separator
`\u001D`	U+001D group separator
`\u001E`	U+001E record separator
`\u001F`	U+001F...

Description

If you are a Java programmer who wants to learn about the fundamental tasks underlying natural language processing, this book is for you. You will be able to identify and use NLP tasks for many common problems, and integrate them in your applications to solve more difficult problems. Readers should be familiar/experienced with Java software development.

Who is this book for?

What you will learn

Develop a deep understanding of the basic NLP tasks and how they relate to each other
Discover and use the available tokenization engines
Implement techniques for end of sentence detection
Apply search techniques to find people and things within a document
Construct solutions to identify parts of speech within sentences
Use parsers to extract relationships between elements of a document
Integrate basic tasks to tackle more complex NLP problems

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Frequently bought together

Can$61.99

Can$69.99

Total Can$ 131.98

Filter reviews by

All

Amazon verified reviews

Ivan Zaitsev May 28, 2015

Book about deep text processing using such tools as OpenNLP, LingPipe, Stanford NLP and other. From searching and tokenizing to classifying and extracting relationships. I usually use jdk standard instruments but these libraries are far more sophisticated and deserve attention.

Amazon Verified review

Amazon Customer May 31, 2015

It is a great book to start to learn programming NLP systems. I started with little experience with Java and NLPs in general but I did learn many things from this book: the basics of word and sentence tokenization, text classification and sentiment analysis, information extraction, parsing, meaning extraction, and question answering. This book will put would-be NLP programmers on their way to use the book’s code as the basis for their own software . Sometimes you won't get somethings when you first read but if you try the examples you will start to get what he is saying. One downside of the book is that some examples are focused on a single API( LingPipe, Apache OpenNLP or Stanford Parser) but I think they did that so you can think on your feet . From my point of view this book is worth every penny if you are programmer and you have to deal with software that needs to do some NLP.

Danijel K. Nov 05, 2016

Good book with a lot of concrete real code examples.

Jon Borgman May 28, 2015

Definitely a great resource for starting out with parsing text. The parts on tokenizers are very well done. Specifically the training of models and then using them was very helpful. What surprised me was the attention to the types of text parsed like what do you do when multiple acronyms and text speech are used. It does an amazing job at taking someone who is new to the subject and fills up your toolbox with tools and concepts that really give you a good picture or what / how to proceed.

Stephen D. Williams May 27, 2015

Provides a great introduction to NLP principles, problems, and related Java NLP libraries, with clear, concise example source code. This is a good way to get started with building NLP enabled applications using practical methods and clear code. To this, I would add example code coverage of the Mallet library, some summary of pros and cons of each library, mention of semantic web & graph databases with the presidents question answering example using DBPedia, and pointers to more expansive and ambitious examples.

Natural Language Processing with Java: Explore various approaches to organize and extract useful text from unstructured data using Java

What do you get with a Packt Subscription?

Natural Language Processing with Java

Chapter 2. Finding Parts of Text

Understanding the parts of text

What is tokenization?

Simple Java tokenizers

Using the Scanner class

NLP tokenizer APIs

Using the OpenNLPTokenizer class

Understanding normalization

Understanding the parts of text

Page 1 of 7

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the authors

FAQs

Natural Language Processing with Java: Explore various approaches to organize and extract useful text from unstructured data using Java

What do you get with a Packt Subscription?

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the authors

FAQs