There are many tools available that support NLP. Some of these are available with the Java SE SDK but are limited in their utility for all but the simplest types of problems. Other libraries, such as Apache's OpenNLP and LingPipe, provide extensive and sophisticated support for NLP problems.
Low-level Java support includes string libraries, such as String, StringBuilder, and StringBuffer. These classes possess methods that perform searching, matching, and text-replacement. Regular expressions use special encoding to match substrings. Java provides a rich set of techniques to use regular expressions.
As discussed earlier, tokenizers are used to split text into individual elements. Java provides supports for tokenizers with:
- The String class' split method
- The StreamTokenizer class
- The StringTokenizer class
There also exist a number of NLP libraries/APIs for Java. A partial list of Java-based NLP APIs can be found in the following table. Most of these are open source. In addition, there are a number of commercial APIs available. We will focus on the open source APIs:
API |
URL |
Apertium |
|
General Architecture for Text Engineering |
|
Learning Based Java |
|
LingPipe |
|
MALLET |
|
MontyLingua |
|
Apache OpenNLP |
|
UIMA |
|
Stanford Parser |
|
Apache Lucene Core |
|
Snowball |
Many of these NLP tasks are combined to form a pipeline. A pipeline consists of various NLP tasks, which are integrated into a series of steps to achieve a processing goal. Examples of frameworks that support pipelines are General Architecture for Text Engineering (GATE) and Apache UIMA.
In the next section, we will cover several NLP APIs in more depth. A brief overview of their capabilities will be presented along with a list of useful links for each API.