Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Elastic Stack 8.x Cookbook
Elastic Stack 8.x Cookbook

Elastic Stack 8.x Cookbook: Over 80 recipes to perform ingestion, search, visualization, and monitoring for actionable insights

Arrow left icon
Profile Icon Huage Chen Profile Icon Yazid Akadiri
Arrow right icon
$9.99 $41.99
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (3 Ratings)
eBook Jun 2024 688 pages 1st Edition
eBook
$9.99 $41.99
Paperback
$51.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Huage Chen Profile Icon Yazid Akadiri
Arrow right icon
$9.99 $41.99
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (3 Ratings)
eBook Jun 2024 688 pages 1st Edition
eBook
$9.99 $41.99
Paperback
$51.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$9.99 $41.99
Paperback
$51.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Elastic Stack 8.x Cookbook

Ingesting General Content Data

This chapter, along with Chapter 4, will focus on data ingestion. Generally, we can categorize data into two groups – general content (data from APIs, HTML pages, catalogs, data from Relational Database Management System (RDBMS), PDFs, spreadsheets, etc.), and time series (data indexed in chronological order, such as logs, metrics, traces, and security events). In this chapter, we will ingest general content to illustrate the basic concepts of data ingestion, including fundamental data operations (index, delete, and update), analyzers, static and dynamic index mappings, and index templates.

Figure 2.1 illustrates the connections between various components, and in this chapter, we will explore recipes dedicated to the Client APP, Analyzer, Mapping, and Index template components (you can view the color image when you download the free PDF version of this book):

Figure 2.1 – Elasticsearch index management components

Figure 2.1 – Elasticsearch index management components

In this chapter, we are going to cover the following main topics:

  • Adding data from the Elasticsearch client
  • Updating data in Elasticsearch
  • Deleting data in Elasticsearch
  • Using an analyzer
  • Defining index mapping
  • Using dynamic templates in document mapping
  • Creating an index template
  • Indexing multiple documents using Bulk API

Introducing the Wikipedia Movie Plots dataset

For the general content sample data that we will use in this chapter, we will use the Wikipedia Movie Plots dataset from kaggle.com, authored by JustinR. (https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots). The dataset contains interesting metadata of more than 34,000 movies scraped from Wikipedia.

Dataset citation

Wikipedia Movie Plots. (2018, October 15). Kaggle: https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots.

Note that this dataset is under the CC BY-SA 4.0 license (https://creativecommons.org/licenses/by-sa/4.0/).

Technical requirements

To follow the different recipes in this chapter, you will need an Elastic Stack deployment that includes the following:

  • Elasticsearch to search and store data
  • Kibana for data visualization and Dev Tools access

In addition to the Elastic Stack deployment, you’ll also need to have Python 3+ installed on your local machine.

Adding data from the Elasticsearch client

To ingest general content such as catalogs, HTML pages, and files from your application, Elastic provides a wide range of Elastic language clients to easily ingest data via Elasticsearch REST APIs. In this recipe, we will learn how to add sample data to Elasticsearch hosted on Elastic Cloud using a Python client.

To use Elasticsearch’s REST APIs through various programming languages, a client application chooses a suitable client library. The client initializes and sends HTTP requests, directing them to the Elasticsearch cluster for data operations. Elasticsearch processes the requests and returns HTTP responses containing results or errors. The client application parses these responses and acts on the data accordingly. Figure 2.2 shows the summarized data flow:

Figure 2.2 – Elasticsearch’s client request and response flow

Figure 2.2 – Elasticsearch’s client request and response flow

Getting ready

To simplify the package management, we recommend you install pip(https://pypi.org/project/pip/).

The snippets of this recipe are available here: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/snippets.md#adding-data-from-the-elasticsearch-client.

How to do it…

First, we will install the Elasticsearch Python client:

  1. Add elasticsearch, elasticsearch-async, and load_dotenv to the requirements.txt file of your Python project (the sample requirements.txt file can be found at this address: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/requirements.txt).
  2. Run the following command to install the Elasticsearch Python client library:
    $ pip install -r requirements.txt

    Now, let’s set up a connection to Elasticsearch.

  3. Prepare a .env file to store the access information, Cloud ID("ES_CID"), user name("ES_USER"), and password("ES_PWD"), for the basic authentication. You can find the sample .env file at this address: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/.env.

    Remember that we saved the password for our default user, elastic, in the Deploying Elastic Stack on Elastic Cloud recipe in Chapter 1, and the instructions to find the cloud ID can be found in the same recipe.

  4. Import the libraries in a Python file (sampledata_index.py), which we will use for this recipe:
    import os
    from elasticsearch import Elasticsearch
    from dotenv import load_dotenv
  5. Load the environment variables and initiate an Elasticsearch connection:
    load_dotenv()
    ES_CID = os.getenv('ES_CID')
    ES_USER = os.getenv('ES_USER')
    ES_PWD = os.getenv('ES_PWD')
    es = Elasticsearch(
        cloud_id=ES_CID,
        basic_auth=(ES_USER, ES_PWD)
    )
    print(es.info())
  6. Now, you can run the script to check whether the connection is successful. Run the following command:
    $ python sampledata_index.py

    You should see an output that looks like the following screenshot:

Figure 2.3 – Connected Elasticsearch information

Figure 2.3 – Connected Elasticsearch information

  1. We can now extend the script to ingest a document. Prepare a sample JSON document from the movie dataset:
    mymovie = {
        'release_year': '1908',
        'title': 'It is not this day.',
        'origin': 'American',
        'director': 'D.W. Griffith',
        'cast': 'Harry Solter, Linda Arvidson',
        'genre': 'comedy',
        'wiki_page':'https://en.wikipedia.org/wiki/A_Calamitous_Elopement',
        'plot': 'A young couple decides to elope after being caught in the midst of a romantic moment by the woman.'
    }
  2. Index the sample data in Elasticsearch. Here, we will choose the index name 'movies' and print the index results. Finally, we will store the document ID in a tmp file that we will reuse for the following recipes:
    response = es.index(index='movies', document=mymovie)
    print(response)
    # Write the '_id' to a file named tmp.txt
    with open('tmp.txt', 'w') as file:
        file.write(response['_id'])
    # Print the contents of the file to confirm it's written correctly
    with open('tmp.txt', 'r') as file:
        print(f"document id saved to tmp.txt: {file.read()}")
    time.sleep(2)
  3. Verify the data in Elasticsearch to ensure that it has been successfully indexed; wait two seconds after the indexing, query Elasticsearch using the _search API, and then print the results:
    response = es.search(index='movies', query={"match_all": {}})
    print("Sample movie data in Elasticsearch:")
    for hit in response['hits']['hits']:
    print(hit['_source'])
  4. Execute the script again with the following script:
    $ python sampledata_index.py

    You should have the following result in the console output:

Figure 2.4 – The output of the sampledata_index.py script

Figure 2.4 – The output of the sampledata_index.py script

The full code sample can be found at https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_index.py.

How it works...

In this recipe, we learned how to use the Elastic Python client to securely connect to a hosted deployment on Elastic Cloud.

Elasticsearch created the movies index by default during the first ingestion, and the fields were created with default mapping.

Later in this chapter, we will learn how to define static and dynamic mapping to customize field types with the help of concrete recipes.

It’s also important to note that as we did not provide a document ID, Elasticsearch automatically generated an ID during the indexing phase as well.

The following diagram (Figure 2.5) shows the index processing flow:

Figure 2.5 – The ingestion flow

Figure 2.5 – The ingestion flow

There’s more…

In this recipe, we used the HTTP basic authentication method. The Elastic Python client provides authentication methods such as HTTP Bearer authentication and API key authentication. Detailed documentation can be found at the following link: https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#auth-bearer.

We chose to illustrate the simplicity of general content data ingestion by using the Python client. Detailed documentation for other client libraries can be found at the following link: https://www.elastic.co/guide/en/elasticsearch/client/index.html

During the development and testing phase, it’s also very useful to use the Elastic REST API and test either with an HTTP client, such as CURL/Postman, or with the Kibana Dev Tools console (https://www.elastic.co/guide/en/kibana/current/console-kibana.html).

Updating data in Elasticsearch

In this recipe, we will explore how to update data in Elasticsearch using the Python client.

Getting ready

Ensure that you have installed the Elasticsearch Python client and have successfully set up a connection to your Elasticsearch cluster (refer to the Adding data from the Elasticsearch client recipe). You will also need to have completed the previous recipe, which involves ingesting a document into the movies index.

Note

The following three recipes will use the same set of requirements.

How to do it…

In this recipe, we’re going to update the director field of a particular document within the movies index. The director field will be changed from its current value, D.W. Griffith, to a new value, Clint Eastwood. The following are the steps you’ll need to follow in your Python script to perform this update and confirm that it has been successfully applied. Let’s inspect the Python script that we will use to update the ingested document (https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_update.py):

  1. First, we need to retrieve the document ID of the previously ingested document from the tmp.txt file, which we intend to update. The field to update here is director; we are going to update the value from D.W. Griffith to Clint Eastwood:
    index_name = 'movies'
    document_id = ''
    # Read the document_id the ingested document of the previous recipe
    with open('tmp.txt', 'r') as file:
        document_id = file.read()
    document = {
        'director': 'Clint Eastwood'
    }
  2. We can now check document_id, verify that the document exists in the index, and then perform the update operation:
    # Update the document in Elasticsearch if document_id is valid
    if document_id != '':
        if es.exists(index=index_name, id=document_id):
            response = es.update(index=index_name, id=document_id,
                                 doc=document)
            print(f"Update status: {response['result']}")
  3. Once the document is updated, to verify that the update is successful, you can retrieve the updated document from Elasticsearch and print the modified fields:
    updated_document = es.get(index=index_name, id=document_id)
    print("Updated document:")
    print(updated_document)
  4. After inspecting the script, let’s run it with the following command:
    $ python sampledata_update.py
Figure 2.6 – The output of the sampledata_update.py script

Figure 2.6 – The output of the sampledata_update.py script

You should see that the _version and director fields are updated.

How it works...

Each document includes a _version field in Elasticsearch. Elasticsearch documents cannot be modified directly, as they are immutable. When you update an existing document, a new document is generated with an incremented version, while the previous document is flagged for deletion.

There’s more…

We have just seen how to update a single document in Elasticsearch; in general, this is not optimal from a performance point of view. To update multiple documents that match a specific query, you can use the Update By Query API. This allows you to define a query to select the documents you want to update and specify the changes to be made; here is an example of how to do it via Elasticsearch’s REST API:

q = {
    "script": {
        "source": "ctx._source.genre = 'comedies'",
        "lang": "painless"
    },
    "query": {
        "bool": {
            "must": [
              {
                "term": {
                    "genre": "comedy"
                }
              }
            ]
        }
    }
}
es.update_by_query(body=q, index=index_name)

The full Python script is available here: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_update_by_query.py.

Note

The script used here is based on a painless script; we will see more examples in Chapter 6.

The other way to update multiple documents in a single request is via Elasticsearch’s Bulk API. The Bulk API can be used to insert, update, and delete multiple documents efficiently. We will learn how to use the Bulk API to ingest multiple documents at the end of this chapter. For more detailed information, refer to the following documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html.

To retrieve the ID of the document we want to update, we rely on a tmp.txt file where the ID of a previously created document was saved. Alternatively, you can retrieve the document’s ID by using the Dev Tools in Kibana, perform a search on the movies index, go to Kibana | Dev Tools, and execute the following command:

GET movies/_search

This query should return a list of hits that display all documents in the index, along with their respective IDs, as shown in Figure 2.7. Using these results, locate and record the ID of the document you would like to update:

Figure 2.7 – Checking the document ID

Figure 2.7 – Checking the document ID

Deleting data in Elasticsearch

In this recipe, we will explore how to delete a document from an Elasticsearch index.

Getting ready

Refer to the requirements for the Updating data in Elasticsearch recipe.

Make sure to download the following Python script from the GitHub repository: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_delete.py.

The snippets of the recipe are available at https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/snippets.md#deleting-data-in-elasticsearch.

How to do it…

  1. First, let us inspect the sampledata_delete.py Python script. Like the process in the previous recipe, we need to retrieve document_id from the tmp.txt file:
    with open('tmp.txt', 'r') as file:
              document_id = file.read()
  2. We can now check document_id, verify that the document exists in the index, and then perform the delete operation by using the previously obtained document_id:
    if document_id != '':
        if es.exists(index=index_name, id=document_id):
            # delete the document in Elasticsearch
            response = es.delete(index=index_name, id=document_id)
            print(f"delete status: {response['result']}")
  3. After reviewing the delete script, execute it with the following command:
    $ python sampledata_delete.py

    You should see the following output:

Figure 2.8 –  The output of the sampledata_delete.py script

Figure 2.8 – The output of the sampledata_delete.py script

  1. For further verification, return to the Dev Tools in Kibana and execute the search request again on the movies index:
    GET movies/_search

    This time, the result should reflect the deletion:

Figure 2.9 – The search results in the movies index after deletion

Figure 2.9 – The search results in the movies index after deletion

The total hits will now be 0, confirming that the document has been successfully deleted.

How it works...

When a document is deleted in Elasticsearch, it is not immediately removed from the index. Instead, Elasticsearch marks the document as deleted. These documents remain in the index until a merging process occurs during routine optimization tasks, when Elasticsearch physically expunges the deleted documents from the index.

This mechanism allows Elasticsearch to handle deletions efficiently. By marking documents as deleted rather than expunging them outright, Elasticsearch avoids costly segment reorganizations within the index. The removal occurs during optimized, controlled background tasks.

There’s more…

While we have discussed deleting documents by document_id, this might not be the most efficient approach for deleting multiple documents. For such scenarios, the Delete By Query API is more suitable, such as the following:

Note

Before executing the upcoming query, it is necessary to re-index the document, since it was deleted earlier in the recipe. Ensure that you have re-added the document to the movies index by executing the sampledata_index.py Python script.

POST /movies/_delete_by_query
{
  "query": {
    "match": {
      "genre": "comedy"
    }
  }
}

The preceding query will delete all movies matching the comedy genre in our index.

Also, when deleting many documents, the best practice is to use the Delete By Query with the slices parameter to improve performance. The Delete by Query feature with the slices parameter in Elasticsearch offers considerable advantages, especially when dealing with the deletion of numerous documents. This best practice enhances performance by splitting a large deletion task into smaller, parallel operations. This method not only boosts the efficiency and reliability of the deletion process but also lessens the burden on the cluster. By dividing the task, you ensure a more balanced and effective approach to managing large-scale deletions in Elasticsearch.

See also

For more details on the Delete By Query feature, refer to the official documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html.

Using an analyzer

In this recipe, we are going to learn how to set up and use a specific analyzer for text analysis. Indexing data in Elasticsearch, especially for search use cases, requires that you define how text should be processed before indexation; this is what analyzers accomplish.

Analyzers in Elasticsearch handle tokenization and normalization functions. Elasticsearch offers a variety of ready-made analyzers for common scenarios, as well as language-specific analyzers for English, German, Spanish, French, Hindi, and so on.

In this recipe, we will see how to configure the standard analyzer with the English stopwords filter.

Getting ready

Make sure that you completed the Adding data from the Elasticsearch client recipe. Also, make sure to download the following sample Python script from the GitHub repository: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_analyzer.py.

The command snippets of this recipe are available at https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/snippets.md#using-analyzer.

How to do it…

In this recipe, you will learn how to configure your Python code to interface with an Elasticsearch cluster, define a custom English text analyzer, create a new index with the analyzer, and verify that the index uses the specified settings.

Let’s look at the provided Python script:

  1. At the beginning of the script, we create an instance of the Elasticsearch client:
    es = Elasticsearch(
        cloud_id=ES_CID,
        basic_auth=(ES_USER, ES_PWD)
    )
  2. To ensure that we do not use an existing movies index, the script includes code that deletes any such index:
    if es.indices.exists(index="movies"):
        print("Deleting existing movies index...")
        es.options(ignore_status=[404, 400]).indices.delete(index="movies")
  3. Next, we define the analyzer configuration:
    index_settings = {
        "analysis": {
            "analyzer": {
                "standard_with_english_stopwords": {
                    "type": "standard",
                    "stopwords": "_english_"
                }
            }
        }
    }
  4. We then create the index with settings that define the analyzer:
    es.indices.create(index='movies', settings=index_settings)
  5. Finally, to verify the successful addition of the analyzer, we retrieve the settings:
    settings = es.indices.get_settings(index='movies')
    analyzer_settings = settings['movies']['settings']['index']['analysis']
    print(f"Analyzer used for the index: {analyzer_settings}")
  6. After reviewing the script, execute it with the following command, and you should see the output shown in Figure 2.10:
    $ python sampledata_analyzer.py
Figure 2.10 – The output of the sampledata_analyzer.py script

Figure 2.10 – The output of the sampledata_analyzer.py script

Alternatively, you can go to Kibana | Dev Tools and issue the following request:

GET /movies/_settings

In the response, you should see the settings currently applied to the movies index with the configured analyzer, as shown in Figure 2.11:

Figure 2.11 – The analyzer configuration in the index settings

Figure 2.11 – The analyzer configuration in the index settings

How it works...

The settings block of the index configuration is where the analyzer is set. As we are modifying the built-in standard analyzer in our recipe, we will give it a unique name (standard_with_english_stopwords) and set the type to standard. Text indexed from this point will undergo analysis by the modified analyzer. To test this, we can use the _analyze endpoint on the index:

POST movies/_analyze
{
  "text": "A young couple decides to elope.",
  "analyzer": "standard_with_stopwords"
}

It should yield the results shown in Figure 2.12:

Figure 2.12 – The index result of a text with the stopword analyzer

Figure 2.12 – The index result of a text with the stopword analyzer

There’s more…

While Elasticsearch offers many built-in analyzers for different languages and text types, you can also define custom analyzers. These allow you to specify how text is broken down and modified for indexing or searching, using components such as tokenizers, token filters, and character filters – either those provided by Elasticsearch or custom ones you create. For example, you can design an analyzer that converts text to lowercase, removes common words, substitutes synonyms, and strips accents.

Reasons for needing a custom analyzer may include the following:

  • Handling various languages and scripts that require special processing, such as Chinese, Japanese, and Arabic
  • Enhancing the relevance and comprehensiveness of search results using synonyms, stemming, lemmatization, and so on
  • Unifying text by removing punctuation, whitespace, and accents and making it case-insensitive

Defining index mapping

In Elasticsearch, mapping refers to the process of defining the schema or structure of an index. It defines how documents and their fields are stored and indexed within Elasticsearch. Mapping allows you to specify the data type of each field, such as text, a keyword, a numeric character, and a date, and configure various properties for each field, including indexing options and analyzers. By defining a mapping, you provide Elasticsearch with crucial information about the data you intend to index, enabling it to efficiently store, search, and analyze the documents.

Mapping plays a critical role in delivering precise search results, efficient data storage, and effective handling of different data types within Elasticsearch.

When no mapping is predefined, Elasticsearch attempts to dynamically infer data types and create the mapping; this is what has occurred with our movie dataset thus far.

In this recipe, we will apply an explicit mapping to the movies index.

Getting ready

Make sure that you have completed the Updating data in Elasticsearch recipe.

All the command snippets for the Dev Tools in this recipe are available at https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/snippets.md#defining-index-mapping.

How to do it…

You can define mappings during index creation or update them in an existing index.

An important note on mappings

When updating the mapping of an existing index that already contains documents, the mapping of those existing documents will not change. The new mapping will only apply to documents indexed afterward.

In this recipe, you are going to create a new index with explicit mapping, and then re-index the data from the movie index, assuming that you have already created that index beforehand:

  1. Head to Kibana | Dev Tools.
  2. Next, let’s check the mapping of the previously created index with the following command:
    GET /movies/_mapping

    You will get the results shown in the following figure. Note that, for readability, some fields were collapsed.

Figure 2.13 – The default mapping on the movies index

Figure 2.13 – The default mapping on the movies index

Let’s review what’s going on in the figure:

a. Examining the current mapping of the genre field reveals a multi-field mapping technique. This approach allows a single field to be indexed in several ways to serve different purposes. For example, the genre field is indexed both as a text field for full-text search and as a keyword field for sorting and aggregation. This dual approach to mapping the genre field is actually beneficial and well-suited for its intended use cases.

b. Examining the release_year field reveals that indexing it as a text field is not optimal, since it represents numerical data, which could be beneficial for range queries, as well as other numeric-specific operations. Retaining the keyword mapping for this field is advantageous for sorting and aggregation purposes. To address this, applying an explicit mapping to treat release_year appropriately as a numerical field is the next step.

c. There are two other fields that will require mapping adjustments – plot and cast. Given their nature, these fields should be indexed solely as text, considering it is unlikely there will be a need to sort or aggregate on these fields. However, this indexing strategy still allows for effective searching against them.

  1. Now, let’s create a new index with the correct explicit mapping for the cast, plot, and release_year fields:
    PUT movies-with-explicit-mapping
    {
      "mappings": {
        "properties": {
          "release_year": {
            "type": "short",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "cast": {
            "type": "text"
          },
          "plot": {
            "type": "text"
          }
        }
      }
    }
  2. Next, reindex the original data in the new index so that the new mapping is applied:
    POST /_reindex
    {
      "source": {
        "index": "movies"
      },
      "dest": {
        "index": "movies-with-explicit-mapping"
      }
    }
  3. Check whether the new mapping has been applied to the new index:
    GET movies-with-explicit-mapping/_mapping

    Figure 2.14 shows the explicit mapping applied to the index:

Figure 2.14 – Explicit mapping

Figure 2.14 – Explicit mapping

How it works...

Explicit mapping in Elasticsearch allows you to define the schema or mapping for your index explicitly. Instead of relying on dynamic mapping, which automatically detects and creates the mapping based on the first indexed document, explicit mapping gives you full control over the data types, field properties, and analysis settings for each field in your index, as shown in Figure 2.15:

Figure 2.15 – The field mapping options

Figure 2.15 – The field mapping options

There’s more…

Mapping is a key aspect of data modeling in Elasticsearch. Avoid relying on dynamic mapping and try, when possible, to explicitly define your mappings to have better control over the field types, properties, and analysis settings. This helps maintain consistency and avoids unexpected field mappings.

You should consider using multi-field mapping to index the same field in different ways, depending on the use cases. For instance, for a full-text search of a string field, text mapping is necessary. If the same string field is mostly used for aggregations, filtering, or sorting, then mapping it to a keyword field is more efficient. Also, consider using mapping limit settings to prevent a mapping explosion (https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-settings-limit.html). A situation where every new ingested document introduces new fields, as with dynamic mapping, can result in defining too many fields in an index. This can cause a mapping explosion. When each new field is continually added to the index mapping, it can grow excessively and lead to memory shortages and recovery challenges.

When it comes to mapping limit settings, there are several best practices to keep in mind. First, limit the number of field mappings to prevent documents from causing a mapping explosion. Second, limit the maximum depth of a field. Third, restrict the number of different nested mappings an index can have. Fourth, set a maximum for the count of nested JSON objects allowed in a single document, across all nested types. Finally, limit the maximum length of a field name. Keep in mind that setting higher limits can affect performance and cause memory problems.

For many years now, Elastic has been developing a specification called Elastic Common Schema (ECS) that provides a consistent and customizable way to structure data in Elasticsearch. Adopting this mapping has a lot of benefits (data correlation, reuse, and future-proofing, to name a few), and as a best practice, always refer to the ECS convention when you consider naming your fields. We will see more examples using ECS in the next chapters.

See also

Using dynamic templates in document mapping

In this recipe, we will explore how to leverage dynamic templates in Elasticsearch to automatically apply mapping rules to fields, based on their data types. Elasticsearch allows you to define dynamic templates that simplify the mapping process by dynamically applying mappings to new fields as they are indexed.

Getting ready

Make sure that you have completed the previous recipes:

  • Using an analyzer
  • Defining index mapping

The snippets of the recipe are available at this address: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/snippets.md#using-dynamic-templates-in-document-mapping.

How to do it…

  1. In our example, the default mapping of the year field is set to the long field type, which is suboptimal for storage. We also want to prepare the document mapping so that if additional year fields such as review_year and award_year are introduced, they will have a dynamically applied mapping. Let’s go to Kibana | Dev Tools, where we can extend the previous mapping as follows:
    PUT movies/_mapping
    {
      "dynamic_templates": [{
        "years_as_short": {
          "match_mapping_type": "long",
            "match": "*year",
              "mapping": {
                "type": "short"
              }
        }
      }]
    }
  2. Next, we ingest a new document with a review_year field using the following command:
    POST movies/_doc/
    {
      "review_year": 1993,
      "release_year": 1992,
      "title": "Reservoir Dogs",
      "origin": "American",
      "director": "Quentin Tarantino",
      "cast": "Harvey Keitel, Tim Roth, Steve Buscemi, Chris Penn, Michael Madsen, Lawrence Tierney",
      "genre": "crime drama",
      "wiki_page": "https://en.wikipedia.org/wiki/Reservoir_Dogs",
      "plot": "a group of criminals whose planned diamond robbery goes disastrously wrong, leading to intense suspicion and betrayal within their ranks."
    }
  3. We can now check the mapping with the following command, and we can see that the movies mapping now contains the dynamic template, and the review_year field correctly maps to short, as shown in Figure 2.16.
    GET /movies/_mapping
Figure 2.16 – Updated mapping for the movies index with a dynamic template

Figure 2.16 – Updated mapping for the movies index with a dynamic template

How it works...

In our example for the years_as_short dynamic template, we configured custom mapping as follows:

  • The match_mapping_type parameter is used to define the data type to be detected. In our example, we try to define the data type for long values.
  • The match parameter is used to define the wildcard for the filename ending with year. It uses a pattern to match the field name. (It is also possible to use the unmatch parameter, which uses one or more patterns to exclude fields matched by match.)
  • mapping is used to define the mapping the match field should use. In our example, we map the target field type to short.

There’s more…

Apart from the example that we have seen in this recipe, dynamic templates can also be used in the following scenarios:

  • Only with a match_mapping_type parameter that applies to all the fields of a single type, without needing to match the field name
  • With patch_match or patch_unmatch for a full dotted patch to the field such as "path_match": "myfield_prefix.*" or "path_unmatch": "*.year".

For timestamped data, it is common to have many numeric fields such as metrics. In such cases, filtering on those fields is rarely required and only aggregation is useful. Therefore, it is recommended to disable indexing on those fields to save disk space. You can find a concrete example in the following documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html#_time_series.

The default dynamic field mapping in Elasticsearch is convenient to get started, but it is beneficial to consider defining field mappings more strategically to optimize storage, memory, and indexing/search speed. The workflow to design new index mappings can be as follows:

  1. Index a sample document containing the desired fields in a dummy index.
  2. Retrieve the dynamic mapping created by Elasticsearch.
  3. Modify and optimize the mapping definition.
  4. Create your index with the custom mapping, either explicit or dynamic.

See also

There are some more resources in Elastic’s official documentation, such as the following:

Creating an index template

In this recipe, we will explore how to use index templates in Elasticsearch to define mappings, settings, and other configurations for new indices. Index templates automate the index creation process and ensure consistency across your Elasticsearch cluster.

Getting ready

Before we begin, familiarize yourself with creating component and index templates by using Kibana Dev Tools as explained in this documentation:

Make sure that you have completed the previous recipes:

  • Using an analyzer
  • Defining index mapping

All the commands for the Dev Tools in this recipe are available at this address: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/snippets.md#creating-an-index-template.

How to do it…

In this recipe, we will create two component templates – one for the genre field and another for *year fields with dynamic mapping – and then combine them in an index template:

  1. Create the first component template for the genre field:
    PUT _component_template/movie-static-mapping
    {
      "template": {
        "mappings": {
          "properties": {
            "genre": {
              "type": "keyword"
            }
          }
        }
      }
    }
  2. Create the second component template for the dynamic *year field:
    PUT _component_template/movie-dynamic-mapping
    {
      "template": {
        "mappings": {
          "dynamic_templates": [{
            "years_as_short": {
              "match_mapping_type": "long",
              "match": "*year",
              "mapping": {
                "type": "short"
              }
            }
          }]
        }
      }
    }
  3. Create the index template, which consists of the component templates that we just created; additionally, we define an explicit mapping director field directly in the index template:
    PUT _index_template/movie-template
    {
      "index_patterns": ["movie*"],
      "template": {
        "settings": {
          "number_of_shards": 1
        },
        "mappings": {
          "_source": {
            "enabled": true
          },
          "properties": {
            "director": {
            "type": "keyword"
            }
          }
        },
        "aliases": {
          "mydata": { }
        }
      },
      "priority": 500,
      "composed_of": ["movie-static-mapping", "movie-dynamic-mapping"],
      "version": 1,
      "_meta": {
        "description": "movie template"
      }
    }
  4. Now, we can index another new movie with a field called award_year, as follows:
    POST movies/_doc/
    {
      "award_year": 1998,
      "release_year": 1997,
      "title": "Titanic",
      "origin": "American",
      "director": "James Cameron",
      "cast": "Leonardo DiCaprio, Kate Winslet, Billy Zane, Frances Fisher, Victor Garber, Kathy Bates, Bill Paxton, Gloria Stuart, David Warner, Suzy Amis",
      "genre": "historical epic",
      "wiki_page": "https://en.wikipedia.org/wiki/Titanic_(1997_film)",
      "plot": "The ill-fated maiden voyage of the RMS Titanic, centering on a love story between a wealthy young woman and a poor artist aboard the luxurious, ill-fated R.M.S. Titanic"
    }
  5. Let’s check the mapping after the document ingestion with the following command:
    GET /movies/_mapping
  6. Note the updated mapping, as illustrated in Figure 2.17, with award_year dynamically mapped to short. Additionally, both the genre and director fields are mapped to keyword, thanks to our field definitions in the movie-static-mapping component template and the movie-template index template.
Figure 2.17 – The updated mapping for the movies index

Figure 2.17 – The updated mapping for the movies index

How it works...

Index templates include various configuration settings, such as shard and replica initialization parameters, mapping configurations, and aliases. They also allow you to assign priorities to templates, with a default priority of 100.

Component templates act as building blocks for index templates, which can comprise settings, aliases, or mappings and can be combined in an index template, using the composed_of parameter.

Legacy index templates were deprecated upon the release of Elasticsearch 7.8.

Figure 2.18 gives you an overview of the relationship between index templates, component templates, and legacy templates:

Figure 2.18 – Index templates versus legacy index templates

Figure 2.18 – Index templates versus legacy index templates

There’s more…

Elasticsearch provides predefined index templates that are associated with index and data stream patterns (you can find more details in Chapter 4), such as logs-*-*, metrics-*-*, and synthetics-*-*, with a default priority of 100. If you wish to create custom index templates that override the predefined ones but still use the same patterns, you can assign a priority value higher than 100. If you want to disable the built-in index and component templates altogether, you can set the stack.templates.enabled configuration parameter to false; the detailed documentation can be found here: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html.

Indexing multiple documents using Bulk API

In this recipe, we will explore how to use the Elasticsearch client to ingest an entire movie dataset using the bulk API. We will also integrate various concepts we have covered in previous recipes, specifically related to mappings, to ensure that the correct mappings are applied to our index.

Getting ready

For this recipe, we will work with the sample Wikipedia Movie Plots dataset introduced at the beginning of the chapter. The file is accessible in the GitHub repository via this URL: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/dataset/wiki_movie_plots_deduped.csv.

Make sure that you have completed the previous recipes:

  • Using an analyzer
  • Creating index template

How to do it…

Head to the GitHub repository to download the Python script at https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_bulk.py and then follow these steps:

  1. Update the .env file with the MOVIE_DATASET variable, which specifies the path to the downloaded movie dataset CSV file:
    MOVIE_DATASET=<the-path-of-the-csv-file>
  2. Once the .env file is correctly configured, run the sampledata_bulk.py Python script. During execution, you should see output similar to the following (note that, for readability, the output image has been truncated):
Figure 2.19 – The output of the sampledata_bulk.py script

Figure 2.19 – The output of the sampledata_bulk.py script

  1. To verify that the new movies index has the appropriate mappings, head to Kibana | Dev Tools and execute the following command:
    GET /movies/_mapping
Figure 2.20 – The movies index with a new mapping

Figure 2.20 – The movies index with a new mapping

  1. As illustrated in Figure 2.20, dynamic mapping on the release_year field was applied to the newly created movies index, despite a mapping being explicitly specified in the script. This occurred because an index template was defined in the Using dynamic templates in document mapping recipe, with the index pattern set to movies*. As a result, any index that matches this pattern will automatically inherit the settings from the template, including its dynamic mapping configuration.
  2. Next, to verify that the entire dataset has been indexed, execute the following command:
    GET /movies/_count

    The command should produce the output illustrated in Figure 2.21. According to this output, your movies index should contain 34,886 documents:

Figure 2.21 – A count of the documents in bulk-indexed movies

Figure 2.21 – A count of the documents in bulk-indexed movies

We have just set up an index with the right explicit mapping and loaded an entire dataset by using the Elasticsearch Python client.

How it works...

The script we’ve provided contains several sections. First, we delete any existing movies indexes to make sure we start from a clean slate. This is the reason you did not see the award_year and review_year fields in the new mapping shown in Figure 2.20. We then use the create_index method to create the movies index and specify the settings and the mappings we wish to apply to the documents that will be stored in this index.

Then, there is the generate_actions function that yields a document for each row in our CSV dataset. This function is then used by the streaming_bulk helper method.

The streaming_bulk helper function in the Elasticsearch Python client is used to perform bulk indexing of documents in Elasticsearch. It is like the bulk helper function, but it is designed to handle large datasets.

The streaming_bulk function accepts an iterable of documents and sends them to Elasticsearch in small batches. This strategy allows you to efficiently process substantial datasets without exhausting system memory.

There’s more…

The Elasticsearch Python Client provides several helper functions for the bulk API, which can be challenging to use directly because of its specific formatting requirements and other considerations. These helpers accept an instance of the es class and an iterable action, which can be any iterable or generator.

The most common format for the iterable action is the same as that returned by the search() method. The bulk() API accepts the index, create, delete, and update actions. The _op_type field is used to specify an action, with _op_type defaulting to index. There are several bulk helpers available, including bulk(), parallel_bulk(), streaming_bulk(), and bulk_index(). The following table outlines these helpers and their preferred use cases:

Bulk helper functions

Use cases

bulk()

This helper is used to perform bulk operations on a single thread. It is ideal for small- to medium-sized datasets and is the simplest of the bulk helpers.

parallel_bulk()

This helper is used to perform bulk operations on multiple threads. It is ideal for large datasets and can significantly improve indexing performance.

streaming_bulk()

This helper is used to perform bulk operations on a large dataset that cannot fit into memory. It is ideal for large datasets and can be used to stream data from a file or other source.

bulk_index()

This helper is used to perform bulk indexing operations on a large dataset that cannot fit into memory. It is ideal for large datasets and can be used to stream data from a file or other source.

Table 2.1 – Bulk helper functions and their associated use cases

See also

If you are interested in more examples of bulk ingest using the Python client, check out the official Python client repository: https://github.com/elastic/elasticsearch-py/tree/main/examples/bulk-ingest.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Explore the diverse capabilities of the Elastic Stack through a comprehensive set of recipes
  • Build search applications, analyze your data, and observe cloud-native applications
  • Harness powerful machine learning and AI features to create data science and search applications
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

Learn how to make the most of the Elastic Stack (ELK Stack) products—including Elasticsearch, Kibana, Elastic Agent, and Logstash—to take data reliably and securely from any source, in any format, and then search, analyze, and visualize it in real-time. This cookbook takes a practical approach to unlocking the full potential of Elastic Stack through detailed recipes step by step. Starting with installing and ingesting data using Elastic Agent and Beats, this book guides you through data transformation and enrichment with various Elastic components and explores the latest advancements in search applications, including semantic search and Generative AI. You'll then visualize and explore your data and create dashboards using Kibana. As you progress, you'll advance your skills with machine learning for data science, get to grips with natural language processing, and discover the power of vector search. The book covers Elastic Observability use cases for log, infrastructure, and synthetics monitoring, along with essential strategies for securing the Elastic Stack. Finally, you'll gain expertise in Elastic Stack operations to effectively monitor and manage your system.

Who is this book for?

This book is for Elastic Stack users, developers, observability practitioners, and data professionals ranging from beginner to expert level. If you’re a developer, you’ll benefit from the easy-to-follow recipes for using APIs and features to build powerful applications, and if you’re an observability practitioner, this book will help you with use cases covering APM, Kubernetes, and cloud monitoring. For data engineers and AI enthusiasts, the book covers dedicated recipes on vector search and machine learning. No prior knowledge of the Elastic Stack is required.

What you will learn

  • Discover techniques for collecting data from diverse sources
  • Visualize data and create dashboards using Kibana to extract business insights
  • Explore machine learning, vector search, and AI capabilities of Elastic Stack
  • Handle data transformation and data formatting
  • Build search solutions from the ingested data
  • Leverage data science tools for in-depth data exploration
  • Monitor and manage your system with Elastic Stack

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jun 28, 2024
Length: 688 pages
Edition : 1st
Language : English
ISBN-13 : 9781837633500
Category :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Jun 28, 2024
Length: 688 pages
Edition : 1st
Language : English
ISBN-13 : 9781837633500
Category :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 156.97
Elastic Stack 8.x Cookbook
$51.99
API Testing and Development with Postman
$49.99
Mastering Ubuntu Server
$54.99
Total $ 156.97 Stars icon
Banner background image

Table of Contents

15 Chapters
Chapter 1: Getting Started – Installing the Elastic Stack Chevron down icon Chevron up icon
Chapter 2: Ingesting General Content Data Chevron down icon Chevron up icon
Chapter 3: Building Search Applications Chevron down icon Chevron up icon
Chapter 4: Timestamped Data Ingestion Chevron down icon Chevron up icon
Chapter 5: Transform Data Chevron down icon Chevron up icon
Chapter 6: Visualize and Explore Data Chevron down icon Chevron up icon
Chapter 7: Alerting and Anomaly Detection Chevron down icon Chevron up icon
Chapter 8: Advanced Data Analysis and Processing Chevron down icon Chevron up icon
Chapter 9: Vector Search and Generative AI Integration Chevron down icon Chevron up icon
Chapter 10: Elastic Observability Solution Chevron down icon Chevron up icon
Chapter 11: Managing Access Control Chevron down icon Chevron up icon
Chapter 12: Elastic Stack Operation Chevron down icon Chevron up icon
Chapter 13: Elastic Stack Monitoring Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(3 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Petuniadontics Aug 19, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The 'Elastic Stack 8.x Cookbook' has been an incredibly helpful guide for me. The way it’s laid out, with clear, step-by-step recipes, makes it so easy to jump in and start applying what you learn to real-world projects. Each chapter feels like it naturally builds on the last, helping you really get a grip on what Elastic Stack can do. And for those who are more experienced, the sections on machine learning and AI offer fresh, exciting ways to push your skills further. I would say whether you’re just starting out or already deep into data analytics this book is a must-have. I highly recommend it!
Amazon Verified review Amazon
GoogleGuy Sep 16, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is a comprehensive yet accessible guide, packed with insights that are both informative and incredibly practical. From the moment I started reading, I was captivated by the clear explanations and step-by-step instructions that make even the most complex concepts easy to grasp. What truly sets this book apart is its real-world examples. Whether you’re a beginner or someone looking to deepen your understanding, this book provides value at every level.
Amazon Verified review Amazon
Patrice Palau Sep 14, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This cookbook is an extremely well structured and to the point set of recipes, written by actual Elastic experts. It covers a wide range of very practical topics, going through the basics, like installing the Elastic stack, understanding general data ingestion and writing a search application, all the way to more advanced topics, like data visualization, data analysis, generative AI, and more. All recipes are divided into similar sections (getting ready, how to do it, how it works, etc.) which makes the book very easy to navigate.Highly recommended for anyone in search of a hands-on source of knowledge on Elastic.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.