What is tokenization?
Tokenization is the process of breaking text down into simpler units. For most text, we are concerned with isolating words. Tokens are split based on a set of delimiters. These delimiters are frequently whitespace characters. Whitespace in Java is defined by the Character
class' isWhitespace
method. These characters are listed in the following table. However, there may be a need at times to use a different set of delimiters. For example, different delimiters can be useful when whitespace delimiters obscure text breaks, such as paragraph boundaries, and detecting these text breaks is important.
Character |
Meaning |
---|---|
Unicode space character |
(space_separator, line_separator, or paragraph_separator) |
|
U+0009 horizontal tabulation |
|
U+000A line feed |
|
U+000B vertical tabulation |
|
U+000C form feed |
|
U+000D carriage return |
|
U+001C file separator |
|
U+001D group separator |
|
U+001E record separator |
|
U+001F... |