The tokenizer in the analyzer receives the output character stream from the character filters and splits this into a token stream, which is the input to the token filter. Three types of tokenizer are supported in Elasticsearch, and they are described as follows:
- Word-oriented tokenizer: This splits the character stream into individual tokens.
- Partial word tokenizer: This splits the character stream into a sequence of characters within a given length.
- Structured text tokenizer: This splits the character stream into known structured tokens such as keywords, email addresses, and zip codes.
We'll give an example for each built-in tokenizer and compile the results into the following tables. Let's first take a look at the Word-oriented tokenizer:
Word-oriented tokenizer | ||
Tokenizer | ||
standard | Input text | "POST https://api.iextrading.com/1.0/stock/acwf... |