Basic operations with Elasticsearch
We have already seen how Elasticsearch stores data and provides REST APIs to perform the operations. In next few sections, we will be performing some basic actions using the command line tool called CURL. Once you have grasped the basics, you will start programming and implementing these concepts using Python and Java in upcoming chapters.
Note
When we create an index, Elasticsearch by default creates five shards and one replica for each shard (this means five primary and five replica shards). This setting can be controlled in the elasticsearch.yml
file by changing the index.number_of_shards
properties and the index.number_of_replicas
settings, or it can also be provided while creating the index.
Once the index is created, the number of shards can't be increased or decreased; however, you can increase or decrease the number of replicas at any time after index creation. So it is better to choose the number of required shards for an index at the time of index creation.
Creating an Index
Let's begin by creating our first index and give this index a name, which is book
in this case. After executing the following command, an index with five shards and one replica will be created:
curl –XPUT 'localhost:9200/books/'
Tip
Uppercase letters and blank spaces are not allowed in index names.
Indexing a document in Elasticsearch
Similar to all databases, Elasticsearch has the concept of having a unique identifier for each document that is known as _id
. This identifier is created in two ways, either you can provide your own unique ID while indexing the data, or if you don't provide any id, Elasticsearch creates a default id for that document. The following are the examples:
curl -XPUT 'localhost:9200/books/elasticsearch/1' -d '{ "name":"Elasticsearch Essentials", "author":"Bharvi Dixit", "tags":["Data Analytics","Text Search","Elasticsearch"], "content":"Added with PUT request" }'
On executing above command, Elasticsearch will give the following response:
{"_index":"books","_type":"elasticsearch","_id":"1","_version":1,"created":true}
However, if you do not provide an id, which is 1
in our case, then you will get the following error:
No handler found for uri [/books/elasticsearch] and method [PUT]
The reason behind the preceding error is that we are using a PUT
request to create a document. However, Elasticsearch has no idea where to store this document (no existing URI for the document is available).
If you want the _id
to be auto generated, you have to use a POST
request. For example:
curl -XPOST 'localhost:9200/books/elasticsearch' -d '{ "name":"Elasticsearch Essentials", "author":"Bharvi Dixit", "tags":["Data Anlytics","Text Search","Elasticsearch"], "content":"Added with POST request" }'
The response from the preceding request will be as follows:
{"_index":"books","_type":"elasticsearch","_id":"AU-ityC8xdEEi6V7cMV5","_version":1,"created":true}
If you open the localhost:9200/_plugin/head
URL, you can perform all the CRUD operations using the HEAD plugin as well:
Some of the stats that you can see in the preceding image are these:
- Cluster name:
elasticsearch_cluster
- Node name:
node-1
- Index name: books
- No. of primary shards: 5
- No. of docs in the index: 2
- No. of unassigned shards (replica shards): 5
Note
Cluster states in Elasticsearch
An Elasticsearch cluster can be in one of the three states: GREEN, YELLOW, or RED. If all the shards, meaning primary as well as replicas, are assigned in the cluster, it will be in the GREEN state. If any one of the replica shards is not assigned because of any problem, then the cluster will be in the YELLOW state. If any one of the primary shards is not assigned on a node, then the cluster will be in the RED state. We will see more on these states in the upcoming chapters. Elasticsearch never assigns a primary and its replica shard on the same node.
Fetching documents
We have stored documents in Elasticsearch. Now we can fetch them using their unique ids with a simple GET
request.
Get a complete document
We have already indexed our document. Now, we can get the document using its document identifier by executing the following command:
curl -XGET 'localhost:9200/books/elasticsearch/1'?pretty
The output of the preceding command is as follows:
{ "_index" : "books", "_type" : "elasticsearch", "_id" : "1", "_version" : 1, "found" : true, "_source":{"name":"Elasticsearch Essentials","author":"Bharvi Dixit", "tags":["Data Anlytics","Text Search","ELasticsearch"],"content":"Added with PUT request"} }
Note
pretty
is used in the preceding request to make the response nicer and more readable.
As you can see, there is a _source
field in the response. This is a special field reserved by Elasticsearch to store all the JSON data. There are options available to not store the data in this field since it comes with an extra disk space requirement. However, this also helps in many ways while returning data from ES, re-indexing data, or doing partial document updates. We will see more on this field in the next chapters.
If the document did not exist in the index, the _found
field would have been marked as false.
Getting part of a document
Sometimes you need only some of the fields to be returned instead of returning the complete document. For these scenarios, you can send the names of the fields to be returned inside the _source
parameter with the GET
request:
curl -XGET 'localhost:9200/books/elasticsearch/1'?_source=name,author
The response of Elasticsearch will be as follows:
{ "_index":"books", "_type":"elasticsearch", "_id":"1", "_version":1, "found":true, "_source":{"author":"Bharvi Dixit","name":"Elasticsearch Essentials"} }
Updating documents
It is possible to update documents in Elasticsearch, which can be done either completely or partially, but updates come with some limitations and costs. In the next sections, we will see how these operations can be performed and how things work behind the scenes.
Updating a whole document
To update a whole document, you can use a similar PUT
/POST
request, which we had used to create a new document:
curl -XPUT 'localhost:9200/books/elasticsearch/1' -d '{ "name":"Elasticsearch Essentials", "author":"Bharvi Dixit", "tags":["Data Analytics","Text Search","Elasticsearch"], "content":"Updated document", "publisher":"pact-pub" }'
The response of Elasticsearch looks like this:
{"_index":"books","_type":"elasticsearch","_id":"1","_version":2,"created":false}
If you look at the response, it shows _version
is 2
and created
is false
, meaning the document is updated.
Updating documents partially
Instead of updating the whole document, we can use the _update
API to do partial updates. As shown in the following example, we will add a new field, updated_time
, to the document for which a script parameter has been used. Elasticsearch uses Groovy scripting by default.
Note
Scripting is by default disabled in Elasticsearch, so to use a script you need to enable it by adding the following parameter to your elasticsearch.yml
file:
script.inline: on
curl -XPOST 'localhost:9200/books/elasticsearch/1/_update' -d '{ "script" : "ctx._source.updated_time= \"2015-09-09T00:00:00\"" }'
The response of the preceding request will be this:
{"_index":"books","_type":"elasticsearch","_id":"1","_version":3}
It shows that a new version has been created in Elasticsearch.
Elasticsearch stores data in indexes that are composed of Lucene segments. These segments are immutable in nature, meaning that, once created, they can't be changed. So, when we send an update request to Elasticsearch, it does the following things in the background:
- Fetches the JSON data from the
_source
field for that document - Makes changes in the
_source
field - Deletes old documents
- Creates a new document
All these data re-indexing tasks can be done by the user; however, if you are using the UPDATE
method, it is done using only one request. These processes are the same when doing a whole document update as for a partial update. The benefit of a partial update is that all operations are done within a single shard, which avoids network overhead.
Deleting documents
To delete a document using its identifier, we need to use the DELETE
request:
curl -XDELETE 'localhost:9200/books/elasticsearch/1'
The following is the response of Elasticsearch:
{"found":true,"_index":"books","_type":"elasticsearch","_id":"1","_version":4}
If you are from a Lucene background, then you must know how segment merging is done and how new segments are created in the background with more documents getting indexed. Whenever we delete a document from Elasticsearch, it does not get deleted from the file system right away. Rather, Elasticsearch just marks that document as deleted, and when you index more data, segment merging is done. At the same time, the documents that are marked as deleted are indeed deleted based on a merge policy. This process is also applied while the document is updated.
The space from deleted documents can also be reclaimed with the _optimize
API by executing the following command:
curl –XPOST http://localhost:9200/_optimize?only_expunge_deletes=true'
Checking documents' existence
While developing applications, some scenarios require you to check whether a document exists or not in Elasticsearch. In these scenarios, rather than querying the documents with a GET
request, you have the option of using another HTTP
request method called HEAD
:
curl -i -XHEAD 'localhost:9200/books/elasticsearch/1'
The following is the response of the preceding command:
HTTP/1.1 200 OK Content-Type: text/plain; charset=UTF-8 Content-Length: 0
In the preceding command, I have used the -i
parameter that is used to show the header information of an HTTP
response. It has been used because the HEAD
request only returns headers and not any content. If the document is found, then status code will be 200
, and if not, then it will be 400
.