You're reading from Elasticsearch Essentials Harness the power of ElasticSearch to build and manage scalable search and analytics solutions with this fast-paced guide

Product type Paperback

Published in Jan 2016

Publisher

ISBN-13 9781784391010

Length 240 pages

Edition 1st Edition

Languages

Java

Tools

Elasticsearch

Concepts

Enterprise Search

Table of Contents (12) Chapters

Preface

1. Getting Started with Elasticsearch

2. Understanding Document Analysis and Creating Mappings FREE CHAPTER

3. Putting Elasticsearch into Action

4. Aggregations for Analytics

5. Data Looks Better on Maps: Master Geo-Spatiality

6. Document Relationships in NoSQL World

7. Different Methods of Search and Bulk Operations

8. Controlling Relevancy

9. Cluster Scaling in Production Deployments

10. Backups and Security

Index

Basic operations with Elasticsearch

We have already seen how Elasticsearch stores data and provides REST APIs to perform the operations. In next few sections, we will be performing some basic actions using the command line tool called CURL. Once you have grasped the basics, you will start programming and implementing these concepts using Python and Java in upcoming chapters.

Note

When we create an index, Elasticsearch by default creates five shards and one replica for each shard (this means five primary and five replica shards). This setting can be controlled in the elasticsearch.yml file by changing the index.number_of_shards properties and the index.number_of_replicas settings, or it can also be provided while creating the index.

Once the index is created, the number of shards can't be increased or decreased; however, you can increase or decrease the number of replicas at any time after index creation. So it is better to choose the number of required shards for an index at the time of index creation.

Creating an Index

Let's begin by creating our first index and give this index a name, which is book in this case. After executing the following command, an index with five shards and one replica will be created:

curl –XPUT 'localhost:9200/books/'

Tip

Uppercase letters and blank spaces are not allowed in index names.

Indexing a document in Elasticsearch

Similar to all databases, Elasticsearch has the concept of having a unique identifier for each document that is known as _id. This identifier is created in two ways, either you can provide your own unique ID while indexing the data, or if you don't provide any id, Elasticsearch creates a default id for that document. The following are the examples:

curl -XPUT 'localhost:9200/books/elasticsearch/1' -d '{
"name":"Elasticsearch Essentials",
"author":"Bharvi Dixit", 
"tags":["Data Analytics","Text Search","Elasticsearch"],
"content":"Added with PUT request"
}'

On executing above command, Elasticsearch will give the following response:

{"_index":"books","_type":"elasticsearch","_id":"1","_version":1,"created":true}

However, if you do not provide an id, which is 1 in our case, then you will get the following error:

No handler found for uri [/books/elasticsearch] and method [PUT]

The reason behind the preceding error is that we are using a PUT request to create a document. However, Elasticsearch has no idea where to store this document (no existing URI for the document is available).

If you want the _id to be auto generated, you have to use a POST request. For example:

curl -XPOST 'localhost:9200/books/elasticsearch' -d '{
"name":"Elasticsearch Essentials",
"author":"Bharvi Dixit", 
"tags":["Data Anlytics","Text Search","Elasticsearch"],
"content":"Added with POST request"
}'

The response from the preceding request will be as follows:

{"_index":"books","_type":"elasticsearch","_id":"AU-ityC8xdEEi6V7cMV5","_version":1,"created":true}

If you open the localhost:9200/_plugin/head URL, you can perform all the CRUD operations using the HEAD plugin as well:

Some of the stats that you can see in the preceding image are these:

Cluster name: elasticsearch_cluster
Node name: node-1
Index name: books
No. of primary shards: 5
No. of docs in the index: 2
No. of unassigned shards (replica shards): 5
Note
Cluster states in Elasticsearch
An Elasticsearch cluster can be in one of the three states: GREEN, YELLOW, or RED. If all the shards, meaning primary as well as replicas, are assigned in the cluster, it will be in the GREEN state. If any one of the replica shards is not assigned because of any problem, then the cluster will be in the YELLOW state. If any one of the primary shards is not assigned on a node, then the cluster will be in the RED state. We will see more on these states in the upcoming chapters. Elasticsearch never assigns a primary and its replica shard on the same node.

Fetching documents

We have stored documents in Elasticsearch. Now we can fetch them using their unique ids with a simple GET request.

Get a complete document

We have already indexed our document. Now, we can get the document using its document identifier by executing the following command:

curl -XGET 'localhost:9200/books/elasticsearch/1'?pretty

The output of the preceding command is as follows:

{
  "_index" : "books",
  "_type" : "elasticsearch",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source":{"name":"Elasticsearch Essentials","author":"Bharvi Dixit", "tags":["Data Anlytics","Text Search","ELasticsearch"],"content":"Added with PUT request"}
}

Note

pretty is used in the preceding request to make the response nicer and more readable.

As you can see, there is a _source field in the response. This is a special field reserved by Elasticsearch to store all the JSON data. There are options available to not store the data in this field since it comes with an extra disk space requirement. However, this also helps in many ways while returning data from ES, re-indexing data, or doing partial document updates. We will see more on this field in the next chapters.

If the document did not exist in the index, the _found field would have been marked as false.

Getting part of a document

Sometimes you need only some of the fields to be returned instead of returning the complete document. For these scenarios, you can send the names of the fields to be returned inside the _source parameter with the GET request:

curl -XGET 'localhost:9200/books/elasticsearch/1'?_source=name,author

The response of Elasticsearch will be as follows:

{
"_index":"books",
"_type":"elasticsearch",
"_id":"1",
"_version":1,
"found":true,
"_source":{"author":"Bharvi Dixit","name":"Elasticsearch Essentials"}
}

Updating documents

It is possible to update documents in Elasticsearch, which can be done either completely or partially, but updates come with some limitations and costs. In the next sections, we will see how these operations can be performed and how things work behind the scenes.

Updating a whole document

To update a whole document, you can use a similar PUT/POST request, which we had used to create a new document:

curl -XPUT 'localhost:9200/books/elasticsearch/1' -d '{
"name":"Elasticsearch Essentials",
"author":"Bharvi Dixit", 
"tags":["Data Analytics","Text Search","Elasticsearch"],
"content":"Updated document",
"publisher":"pact-pub"
}'

The response of Elasticsearch looks like this:

{"_index":"books","_type":"elasticsearch","_id":"1","_version":2,"created":false}

If you look at the response, it shows _version is 2 and created is false, meaning the document is updated.

Updating documents partially

Instead of updating the whole document, we can use the _update API to do partial updates. As shown in the following example, we will add a new field, updated_time, to the document for which a script parameter has been used. Elasticsearch uses Groovy scripting by default.

Note

Scripting is by default disabled in Elasticsearch, so to use a script you need to enable it by adding the following parameter to your elasticsearch.yml file:

script.inline: on

curl -XPOST 'localhost:9200/books/elasticsearch/1/_update' -d '{

   "script" : "ctx._source.updated_time= \"2015-09-09T00:00:00\""

}'

The response of the preceding request will be this:

{"_index":"books","_type":"elasticsearch","_id":"1","_version":3}

It shows that a new version has been created in Elasticsearch.

Elasticsearch stores data in indexes that are composed of Lucene segments. These segments are immutable in nature, meaning that, once created, they can't be changed. So, when we send an update request to Elasticsearch, it does the following things in the background:

Fetches the JSON data from the _source field for that document
Makes changes in the _source field
Deletes old documents
Creates a new document

All these data re-indexing tasks can be done by the user; however, if you are using the UPDATE method, it is done using only one request. These processes are the same when doing a whole document update as for a partial update. The benefit of a partial update is that all operations are done within a single shard, which avoids network overhead.

Deleting documents

To delete a document using its identifier, we need to use the DELETE request:

curl -XDELETE 'localhost:9200/books/elasticsearch/1'

The following is the response of Elasticsearch:

{"found":true,"_index":"books","_type":"elasticsearch","_id":"1","_version":4}

If you are from a Lucene background, then you must know how segment merging is done and how new segments are created in the background with more documents getting indexed. Whenever we delete a document from Elasticsearch, it does not get deleted from the file system right away. Rather, Elasticsearch just marks that document as deleted, and when you index more data, segment merging is done. At the same time, the documents that are marked as deleted are indeed deleted based on a merge policy. This process is also applied while the document is updated.

The space from deleted documents can also be reclaimed with the _optimize API by executing the following command:

curl –XPOST http://localhost:9200/_optimize?only_expunge_deletes=true'

Checking documents' existence

While developing applications, some scenarios require you to check whether a document exists or not in Elasticsearch. In these scenarios, rather than querying the documents with a GET request, you have the option of using another HTTP request method called HEAD:

curl -i -XHEAD 'localhost:9200/books/elasticsearch/1'

The following is the response of the preceding command:

HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8
Content-Length: 0

In the preceding command, I have used the -i parameter that is used to show the header information of an HTTP response. It has been used because the HEAD request only returns headers and not any content. If the document is found, then status code will be 200, and if not, then it will be 400.