An analyzer is a package that contains three building blocks: character filters, tokenizers, and token filters. A user can create a custom analyzer by using these or other building blocks to create the functionality needed. Allow me to elaborate more on what these building blocks are:
- Character filters convert text into a stream of characters. They can transform the stream by adding, removing, or changing the format of the characters. For example, a character filer can change the & character to the word and. An analyzer may have no character filters, or many, but they are always applied in order.
- Tokenizers receive the stream of characters and break it down into tokens. The output will then be a stream of tokens. For example, a whitespace tokenizer breaks the text using whitespaces: Hello World! into [hello, world]. It also records the order of the...