Chapter 1. Introduction to NLP
Natural Language Processing (NLP) is a broad topic focused on the use of computers to analyze natural languages. It addresses areas such as speech processing, relationship extraction, document categorization, and summation of text. However, these types of analysis are based on a set of fundamental techniques such as tokenization, sentence detection, classification, and extracting relationships. These basic techniques are the focus of this book. We will start with a detailed discussion of NLP, investigate why it is important, and identify application areas.
There are many tools available that support NLP tasks. We will focus on the Java language and how various Java Application Programmer Interfaces (APIs) support NLP. In this chapter, we will briefly identify the major APIs, including Apache's OpenNLP, Stanford NLP libraries, LingPipe, and GATE.
This is followed by a discussion of the basic NLP techniques illustrated in this book. The nature and use of these techniques is presented and illustrated using one of the NLP APIs. Many of these techniques will use models. Models are similar to a set of rules that are used to perform a task such as tokenizing text. They are typically represented by a class that is instantiated from a file. We round off the chapter with a brief discussion on how data can be prepared to support NLP tasks.
NLP is not easy. While some problems can be solved relatively easily, there are many others that require the use of sophisticated techniques. We will strive to provide a foundation for NLP processing so that you will be able to understand better which techniques are available and applicable for a given problem.
NLP is a large and complex field. In this book, we will only be able to address a small part of it. We will focus on core NLP tasks that can be implemented using Java. Throughout this book, we will demonstrate a number of NLP techniques using both the Java SE SDK and other libraries, such as OpenNLP and Stanford NLP. To use these libraries, there are specific API JAR files that need to be associated with the project in which they are being used. A discussion of these libraries is found in the Survey of NLP tools section and contains download links to the libraries. The examples in this book were developed using NetBeans 8.0.2. These projects required the API JAR files to be added to the Libraries category of the Projects Properties dialog box.