You're reading from Elastic Stack 8.x Cookbook Over 80 recipes to perform ingestion, search, visualization, and monitoring for actionable insights

Product type Paperback

Published in Jun 2024

Publisher Packt

ISBN-13 9781837634293

Length 688 pages

Edition 1st Edition

Tools

Elasticsearch

Concepts

Enterprise Search

Authors (2):

Yazid Akadiri

Huage Chen

View More author details

Table of Contents (16) Chapters

Preface

1. Chapter 1: Getting Started – Installing the Elastic Stack

2. Chapter 2: Ingesting General Content Data FREE CHAPTER

3. Chapter 3: Building Search Applications

4. Chapter 4: Timestamped Data Ingestion

5. Chapter 5: Transform Data

6. Chapter 6: Visualize and Explore Data

7. Chapter 7: Alerting and Anomaly Detection

8. Chapter 8: Advanced Data Analysis and Processing

9. Chapter 9: Vector Search and Generative AI Integration

10. Chapter 10: Elastic Observability Solution

11. Chapter 11: Managing Access Control

12. Chapter 12: Elastic Stack Operation

13. Chapter 13: Elastic Stack Monitoring

14. Index

Why subscribe?

15. Other Books You May Enjoy

Using an analyzer

In this recipe, we are going to learn how to set up and use a specific analyzer for text analysis. Indexing data in Elasticsearch, especially for search use cases, requires that you define how text should be processed before indexation; this is what analyzers accomplish.

Analyzers in Elasticsearch handle tokenization and normalization functions. Elasticsearch offers a variety of ready-made analyzers for common scenarios, as well as language-specific analyzers for English, German, Spanish, French, Hindi, and so on.

In this recipe, we will see how to configure the standard analyzer with the English stopwords filter.

Getting ready

Make sure that you completed the Adding data from the Elasticsearch client recipe. Also, make sure to download the following sample Python script from the GitHub repository: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_analyzer.py.

The command snippets of this recipe are available at https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/snippets.md#using-analyzer.

How to do it…

In this recipe, you will learn how to configure your Python code to interface with an Elasticsearch cluster, define a custom English text analyzer, create a new index with the analyzer, and verify that the index uses the specified settings.

Let’s look at the provided Python script:

At the beginning of the script, we create an instance of the Elasticsearch client:

es = Elasticsearch(
    cloud_id=ES_CID,
    basic_auth=(ES_USER, ES_PWD)
)

To ensure that we do not use an existing movies index, the script includes code that deletes any such index:

if es.indices.exists(index="movies"):
    print("Deleting existing movies index...")
    es.options(ignore_status=[404, 400]).indices.delete(index="movies")

Next, we define the analyzer configuration:

index_settings = {
    "analysis": {
        "analyzer": {
            "standard_with_english_stopwords": {
                "type": "standard",
                "stopwords": "_english_"
            }
        }
    }
}

We then create the index with settings that define the analyzer:
```
es.indices.create(index='movies', settings=index_settings)
```

Finally, to verify the successful addition of the analyzer, we retrieve the settings:

settings = es.indices.get_settings(index='movies')
analyzer_settings = settings['movies']['settings']['index']['analysis']
print(f"Analyzer used for the index: {analyzer_settings}")

After reviewing the script, execute it with the following command, and you should see the output shown in Figure 2.10:
```
$ python sampledata_analyzer.py
```

Figure 2.10 – The output of the sampledata_analyzer.py script

Alternatively, you can go to Kibana | Dev Tools and issue the following request:

GET /movies/_settings

In the response, you should see the settings currently applied to the movies index with the configured analyzer, as shown in Figure 2.11:

Figure 2.11 – The analyzer configuration in the index settings

How it works...

The settings block of the index configuration is where the analyzer is set. As we are modifying the built-in standard analyzer in our recipe, we will give it a unique name (standard_with_english_stopwords) and set the type to standard. Text indexed from this point will undergo analysis by the modified analyzer. To test this, we can use the _analyze endpoint on the index:

POST movies/_analyze
{
  "text": "A young couple decides to elope.",
  "analyzer": "standard_with_stopwords"
}

It should yield the results shown in Figure 2.12:

Figure 2.12 – The index result of a text with the stopword analyzer

There’s more…

While Elasticsearch offers many built-in analyzers for different languages and text types, you can also define custom analyzers. These allow you to specify how text is broken down and modified for indexing or searching, using components such as tokenizers, token filters, and character filters – either those provided by Elasticsearch or custom ones you create. For example, you can design an analyzer that converts text to lowercase, removes common words, substitutes synonyms, and strips accents.

Reasons for needing a custom analyzer may include the following:

Handling various languages and scripts that require special processing, such as Chinese, Japanese, and Arabic
Enhancing the relevance and comprehensiveness of search results using synonyms, stemming, lemmatization, and so on
Unifying text by removing punctuation, whitespace, and accents and making it case-insensitive