NLP tokenizer APIs
In this section, we will demonstrate several different tokenization techniques using the OpenNLP, Stanford, and LingPipe APIs. Although there are a number of other APIs available, we restricted the demonstration to these APIs. The examples will give you an idea of what techniques are available.
We will use a string called paragraph
to illustrate these techniques. The string includes a new line break that may occur in real text in unexpected places. It is defined here:
private String paragraph = "Let's pause, \nand then ++ "reflect.";
Using the OpenNLPTokenizer class
OpenNLP possesses a Tokenizer
interface that is implemented by three classes: SimpleTokenizer
, TokenizerME
, and WhitespaceTokenizer
. This interface supports two methods:
tokenize
: This is passed a string to tokenize and returns an array of tokens as strings.tokenizePos
: This is passed a string and returns an array ofSpan
objects. TheSpan
class is used to specify the beginning and ending offsets...