Tokenization is the process of breaking text down into simpler units. For most text, we are concerned with isolating words. Tokens are split based on a set of delimiters. These delimiters are frequently whitespace characters. Whitespace in Java is defined by the Character class' isWhitespace method. These characters are listed in the following table. However, there may be a need, at times, to use a different set of delimiters. For example, different delimiters can be useful when whitespace delimiters obscure text breaks, such as paragraph boundaries, and detecting these text breaks is important:
Character |
Meaning |
Unicode space character |
(space_separator, line_separator, or paragraph_separator) |
\t |
U+0009 horizontal tabulation |
\n |
U+000A line feed |
\u000B |
U+000B vertical tabulation |
\f |
U+000C form feed |
\r |
U+000D... |