In this recipe, we will create an instance of the OpenNLP SimpleTokenizer class to illustrate tokenization. We will use its tokenize method against a sample text.
Tokenization using OpenNLP
Getting ready
To prepare, we need to do the following:
- Create a new Java project
- Add the following POM dependency to your project:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.9.0</version>
</dependency>
How to do it...
Let's go through the following steps:
- Start by adding the following import statement to your project's class:
import opennlp.tools.tokenize.SimpleTokenizer;
- Next, add the following main method to your project:
public static void main(String[] args) {
String sampleText =
"In addition, the rook was moved too far to be effective.";
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
String tokenList[] = simpleTokenizer.tokenize(sampleText);
for (String token : tokenList) {
System.out.println(token);
}
}
After executing the program, you should get the following output:
In
addition
,
the
rook
was
moved
too
far
to
be
effective
.
How it works...
The SimpleTokenizer instance represents a tokenizer that will split text using white space delimiters, which are accessed through the class's INSTANCE field. With this tokenizer, we use its tokenize method to pass a single string returning an array of strings, as shown in the following code:
String sampleText =
"In addition, the rook was moved too far to be effective.";
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
String tokenList[] = simpleTokenizer.tokenize(sampleText);
We then iterated through the list of tokens and displayed one per line. Note how the tokenizer treats the comma and the period as tokens.
See also
- The OpenNLP API documentation can be found at https://opennlp.apache.org/docs/1.9.0/apidocs/opennlp-tools/index.html