Using an analyzer
In this recipe, we are going to learn how to set up and use a specific analyzer for text analysis. Indexing data in Elasticsearch, especially for search use cases, requires that you define how text should be processed before indexation; this is what analyzers accomplish.
Analyzers in Elasticsearch handle tokenization and normalization functions. Elasticsearch offers a variety of ready-made analyzers for common scenarios, as well as language-specific analyzers for English, German, Spanish, French, Hindi, and so on.
In this recipe, we will see how to configure the standard analyzer with the English stopwords filter.
Getting ready
Make sure that you completed the Adding data from the Elasticsearch client recipe. Also, make sure to download the following sample Python script from the GitHub repository: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_analyzer.py.
The command snippets of this recipe are available at https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/snippets.md#using-analyzer.
How to do it…
In this recipe, you will learn how to configure your Python code to interface with an Elasticsearch cluster, define a custom English text analyzer, create a new index with the analyzer, and verify that the index uses the specified settings.
Let’s look at the provided Python script:
- At the beginning of the script, we create an instance of the Elasticsearch client:
es = Elasticsearch( cloud_id=ES_CID, basic_auth=(ES_USER, ES_PWD) )
- To ensure that we do not use an existing
movies
index, the script includes code that deletes any such index:if es.indices.exists(index="movies"): print("Deleting existing movies index...") es.options(ignore_status=[404, 400]).indices.delete(index="movies")
- Next, we define the analyzer configuration:
index_settings = { "analysis": { "analyzer": { "standard_with_english_stopwords": { "type": "standard", "stopwords": "_english_" } } } }
- We then create the index with settings that define the analyzer:
es.indices.create(index='movies', settings=index_settings)
- Finally, to verify the successful addition of the analyzer, we retrieve the settings:
settings = es.indices.get_settings(index='movies') analyzer_settings = settings['movies']['settings']['index']['analysis'] print(f"Analyzer used for the index: {analyzer_settings}")
- After reviewing the script, execute it with the following command, and you should see the output shown in Figure 2.10:
$ python sampledata_analyzer.py
Figure 2.10 – The output of the sampledata_analyzer.py script
Alternatively, you can go to Kibana | Dev Tools and issue the following request:
GET /movies/_settings
In the response, you should see the settings currently applied to the movies
index with the configured analyzer, as shown in Figure 2.11:
Figure 2.11 – The analyzer configuration in the index settings
How it works...
The settings
block of the index configuration is where the analyzer is set. As we are modifying the built-in standard analyzer in our recipe, we will give it a unique name (standard_with_english_stopwords
) and set the type to standard
. Text indexed from this point will undergo analysis by the modified analyzer. To test this, we can use the _analyze
endpoint on the index:
POST movies/_analyze { "text": "A young couple decides to elope.", "analyzer": "standard_with_stopwords" }
It should yield the results shown in Figure 2.12:
Figure 2.12 – The index result of a text with the stopword analyzer
There’s more…
While Elasticsearch offers many built-in analyzers for different languages and text types, you can also define custom analyzers. These allow you to specify how text is broken down and modified for indexing or searching, using components such as tokenizers, token filters, and character filters – either those provided by Elasticsearch or custom ones you create. For example, you can design an analyzer that converts text to lowercase, removes common words, substitutes synonyms, and strips accents.
Reasons for needing a custom analyzer may include the following:
- Handling various languages and scripts that require special processing, such as Chinese, Japanese, and Arabic
- Enhancing the relevance and comprehensiveness of search results using synonyms, stemming, lemmatization, and so on
- Unifying text by removing punctuation, whitespace, and accents and making it case-insensitive