Document analysis
When we index documents into Elasticsearch, it goes through an analysis phase that is necessary in order to create inverted indexes. It is a series of steps performed by Lucene, which is depicted in the following image:
The analysis phase is performed by analyzers that are composed of one or more char filters, a single tokenizer, and one or more token filters. You can declare separate analyzers for each field in your document depending on the need. For the same field, the analyzers can be the same for both indexing and searching or they can be different.
- Character Filters: The job of character filters is to do cleanup tasks such as stripping out HTML tags.
- Tokenizers: The next step is to split the text into terms that are called tokens. This is done by a tokenizer. The splitting can be done based on any rule such as whitespace. More details about tokenizers can be found at this URL: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html.
- Token...