Elastic Stack 8.x Cookbook

Ingesting General Content Data

This chapter, along with Chapter 4, will focus on data ingestion. Generally, we can categorize data into two groups – general content (data from APIs, HTML pages, catalogs, data from Relational Database Management System (RDBMS), PDFs, spreadsheets, etc.), and time series (data indexed in chronological order, such as logs, metrics, traces, and security events). In this chapter, we will ingest general content to illustrate the basic concepts of data ingestion, including fundamental data operations (index, delete, and update), analyzers, static and dynamic index mappings, and index templates.

Figure 2.1 illustrates the connections between various components, and in this chapter, we will explore recipes dedicated to the Client APP, Analyzer, Mapping, and Index template components (you can view the color image when you download the free PDF version of this book):

Figure 2.1 – Elasticsearch index management components

In this chapter, we are going to cover the following main topics:

Adding data from the Elasticsearch client
Updating data in Elasticsearch
Deleting data in Elasticsearch
Using an analyzer
Defining index mapping
Using dynamic templates in document mapping
Creating an index template
Indexing multiple documents using Bulk API

Adding data from the Elasticsearch client

To ingest general content such as catalogs, HTML pages, and files from your application, Elastic provides a wide range of Elastic language clients to easily ingest data via Elasticsearch REST APIs. In this recipe, we will learn how to add sample data to Elasticsearch hosted on Elastic Cloud using a Python client.

To use Elasticsearch’s REST APIs through various programming languages, a client application chooses a suitable client library. The client initializes and sends HTTP requests, directing them to the Elasticsearch cluster for data operations. Elasticsearch processes the requests and returns HTTP responses containing results or errors. The client application parses these responses and acts on the data accordingly. Figure 2.2 shows the summarized data flow:

Figure 2.2 – Elasticsearch’s client request and response flow

Getting ready

To simplify the package management, we recommend you install pip(https://pypi.org/project/pip/).

The snippets of this recipe are available here: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/snippets.md#adding-data-from-the-elasticsearch-client.

How to do it…

First, we will install the Elasticsearch Python client:

Add elasticsearch, elasticsearch-async, and load_dotenv to the requirements.txt file of your Python project (the sample requirements.txt file can be found at this address: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/requirements.txt).
Run the following command to install the Elasticsearch Python client library:
```
$ pip install -r requirements.txt
```
Now, let’s set up a connection to Elasticsearch.
Prepare a .env file to store the access information, Cloud ID("ES_CID"), user name("ES_USER"), and password("ES_PWD"), for the basic authentication. You can find the sample .env file at this address: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/.env.
Remember that we saved the password for our default user, elastic, in the Deploying Elastic Stack on Elastic Cloud recipe in Chapter 1, and the instructions to find the cloud ID can be found in the same recipe.
Import the libraries in a Python file (sampledata_index.py), which we will use for this recipe:
```
import os
from elasticsearch import Elasticsearch
from dotenv import load_dotenv
```

Load the environment variables and initiate an Elasticsearch connection:

load_dotenv()
ES_CID = os.getenv('ES_CID')
ES_USER = os.getenv('ES_USER')
ES_PWD = os.getenv('ES_PWD')
es = Elasticsearch(
    cloud_id=ES_CID,
    basic_auth=(ES_USER, ES_PWD)
)
print(es.info())

Now, you can run the script to check whether the connection is successful. Run the following command:
```
$ python sampledata_index.py
```
You should see an output that looks like the following screenshot:

Figure 2.3 – Connected Elasticsearch information

We can now extend the script to ingest a document. Prepare a sample JSON document from the movie dataset:

mymovie = {
    'release_year': '1908',
    'title': 'It is not this day.',
    'origin': 'American',
    'director': 'D.W. Griffith',
    'cast': 'Harry Solter, Linda Arvidson',
    'genre': 'comedy',
    'wiki_page':'https://en.wikipedia.org/wiki/A_Calamitous_Elopement',
    'plot': 'A young couple decides to elope after being caught in the midst of a romantic moment by the woman.'
}

Index the sample data in Elasticsearch. Here, we will choose the index name 'movies' and print the index results. Finally, we will store the document ID in a tmp file that we will reuse for the following recipes:

response = es.index(index='movies', document=mymovie)
print(response)
# Write the '_id' to a file named tmp.txt
with open('tmp.txt', 'w') as file:
    file.write(response['_id'])
# Print the contents of the file to confirm it's written correctly
with open('tmp.txt', 'r') as file:
    print(f"document id saved to tmp.txt: {file.read()}")
time.sleep(2)

Verify the data in Elasticsearch to ensure that it has been successfully indexed; wait two seconds after the indexing, query Elasticsearch using the _search API, and then print the results:
```
response = es.search(index='movies', query={"match_all": {}})
print("Sample movie data in Elasticsearch:")
for hit in response['hits']['hits']:
print(hit['_source'])
```
Execute the script again with the following script:
```
$ python sampledata_index.py
```
You should have the following result in the console output:

Figure 2.4 – The output of the sampledata_index.py script

The full code sample can be found at https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_index.py.

How it works...

In this recipe, we learned how to use the Elastic Python client to securely connect to a hosted deployment on Elastic Cloud.

Elasticsearch created the movies index by default during the first ingestion, and the fields were created with default mapping.

Later in this chapter, we will learn how to define static and dynamic mapping to customize field types with the help of concrete recipes.

It’s also important to note that as we did not provide a document ID, Elasticsearch automatically generated an ID during the indexing phase as well.

The following diagram (Figure 2.5) shows the index processing flow:

Figure 2.5 – The ingestion flow

There’s more…

In this recipe, we used the HTTP basic authentication method. The Elastic Python client provides authentication methods such as HTTP Bearer authentication and API key authentication. Detailed documentation can be found at the following link: https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#auth-bearer.

We chose to illustrate the simplicity of general content data ingestion by using the Python client. Detailed documentation for other client libraries can be found at the following link: https://www.elastic.co/guide/en/elasticsearch/client/index.html

During the development and testing phase, it’s also very useful to use the Elastic REST API and test either with an HTTP client, such as CURL/Postman, or with the Kibana Dev Tools console (https://www.elastic.co/guide/en/kibana/current/console-kibana.html).

Updating data in Elasticsearch

In this recipe, we will explore how to update data in Elasticsearch using the Python client.

Getting ready

Ensure that you have installed the Elasticsearch Python client and have successfully set up a connection to your Elasticsearch cluster (refer to the Adding data from the Elasticsearch client recipe). You will also need to have completed the previous recipe, which involves ingesting a document into the movies index.

Note

The following three recipes will use the same set of requirements.

How to do it…

In this recipe, we’re going to update the director field of a particular document within the movies index. The director field will be changed from its current value, D.W. Griffith, to a new value, Clint Eastwood. The following are the steps you’ll need to follow in your Python script to perform this update and confirm that it has been successfully applied. Let’s inspect the Python script that we will use to update the ingested document (https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_update.py):

First, we need to retrieve the document ID of the previously ingested document from the tmp.txt file, which we intend to update. The field to update here is director; we are going to update the value from D.W. Griffith to Clint Eastwood:
```
index_name = 'movies'
document_id = ''
# Read the document_id the ingested document of the previous recipe
with open('tmp.txt', 'r') as file:
    document_id = file.read()
document = {
    'director': 'Clint Eastwood'
}
```

We can now check document_id, verify that the document exists in the index, and then perform the update operation:

# Update the document in Elasticsearch if document_id is valid
if document_id != '':
    if es.exists(index=index_name, id=document_id):
        response = es.update(index=index_name, id=document_id,
                             doc=document)
        print(f"Update status: {response['result']}")

Once the document is updated, to verify that the update is successful, you can retrieve the updated document from Elasticsearch and print the modified fields:
```
updated_document = es.get(index=index_name, id=document_id)
print("Updated document:")
print(updated_document)
```
After inspecting the script, let’s run it with the following command:
```
$ python sampledata_update.py
```

Figure 2.6 – The output of the sampledata_update.py script

You should see that the _version and director fields are updated.

How it works...

Each document includes a _version field in Elasticsearch. Elasticsearch documents cannot be modified directly, as they are immutable. When you update an existing document, a new document is generated with an incremented version, while the previous document is flagged for deletion.

There’s more…

We have just seen how to update a single document in Elasticsearch; in general, this is not optimal from a performance point of view. To update multiple documents that match a specific query, you can use the Update By Query API. This allows you to define a query to select the documents you want to update and specify the changes to be made; here is an example of how to do it via Elasticsearch’s REST API:

q = {
    "script": {
        "source": "ctx._source.genre = 'comedies'",
        "lang": "painless"
    },
    "query": {
        "bool": {
            "must": [
              {
                "term": {
                    "genre": "comedy"
                }
              }
            ]
        }
    }
}
es.update_by_query(body=q, index=index_name)

The full Python script is available here: https://github.com/PacktPublishing/Elastic-Stack-8.x-Cookbook/blob/main/Chapter2/python-client-sample/sampledata_update_by_query.py.

Note

The script used here is based on a painless script; we will see more examples in Chapter 6.

The other way to update multiple documents in a single request is via Elasticsearch’s Bulk API. The Bulk API can be used to insert, update, and delete multiple documents efficiently. We will learn how to use the Bulk API to ingest multiple documents at the end of this chapter. For more detailed information, refer to the following documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html.

To retrieve the ID of the document we want to update, we rely on a tmp.txt file where the ID of a previously created document was saved. Alternatively, you can retrieve the document’s ID by using the Dev Tools in Kibana, perform a search on the movies index, go to Kibana | Dev Tools, and execute the following command:

GET movies/_search

This query should return a list of hits that display all documents in the index, along with their respective IDs, as shown in Figure 2.7. Using these results, locate and record the ID of the document you would like to update:

Figure 2.7 – Checking the document ID

Petuniadontics Aug 19, 2024

The 'Elastic Stack 8.x Cookbook' has been an incredibly helpful guide for me. The way it’s laid out, with clear, step-by-step recipes, makes it so easy to jump in and start applying what you learn to real-world projects. Each chapter feels like it naturally builds on the last, helping you really get a grip on what Elastic Stack can do. And for those who are more experienced, the sections on machine learning and AI offer fresh, exciting ways to push your skills further. I would say whether you’re just starting out or already deep into data analytics this book is a must-have. I highly recommend it!

Amazon Verified review

GoogleGuy Sep 16, 2024

This book is a comprehensive yet accessible guide, packed with insights that are both informative and incredibly practical. From the moment I started reading, I was captivated by the clear explanations and step-by-step instructions that make even the most complex concepts easy to grasp. What truly sets this book apart is its real-world examples. Whether you’re a beginner or someone looking to deepen your understanding, this book provides value at every level.

Patrice Palau Sep 14, 2024

This cookbook is an extremely well structured and to the point set of recipes, written by actual Elastic experts. It covers a wide range of very practical topics, going through the basics, like installing the Elastic stack, understanding general data ingestion and writing a search application, all the way to more advanced topics, like data visualization, data analysis, generative AI, and more. All recipes are divided into similar sections (getting ready, how to do it, how it works, etc.) which makes the book very easy to navigate.Highly recommended for anyone in search of a hands-on source of knowledge on Elastic.

Elastic Stack 8.x Cookbook: Over 80 recipes to perform ingestion, search, visualization, and monitoring for actionable insights

What do you get with Print?

Elastic Stack 8.x Cookbook

Ingesting General Content Data

Introducing the Wikipedia Movie Plots dataset

Technical requirements

Adding data from the Elasticsearch client

Getting ready

How to do it…

How it works...

There’s more…

Updating data in Elasticsearch

Getting ready

How to do it…

How it works...

There’s more…

Page 1 of 11

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

About the authors

FAQs

Elastic Stack 8.x Cookbook: Over 80 recipes to perform ingestion, search, visualization, and monitoring for actionable insights

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

About the authors

FAQs