You're reading from Learning Elasticsearch Structured and unstructured data using distributed real-time search and analytics

Product type Paperback

Published in Jun 2017

Publisher Packt

ISBN-13 9781787128453

Length 404 pages

Edition 1st Edition

Tools

Elasticsearch

Concepts

Enterprise Search

Author (1):

Abhishek Andhavarapu

View More author details

Scalability and availability

Let's say you want to index a billion documents; having just a single machine might be very challenging. Partitioning data across multiple machines allows Elasticsearch to scale beyond what a single machine do and support high throughput operations. Your data is split into small parts called shards. When you create a index, you need to tell Elasticsearch the number of shards you want for the index and Elasticsearch handles the rest for you. As you have more data, you can scale horizontally by adding more machines. We will go in to more details in the sections below.

There are type of shards in Elasticsearch - primary and replica. The data you index is written to both primary and replica shards. Replica is the exact copy of the primary. In case of the node containing the primary shard goes down, the replica takes over. This process is completely transparent and managed by Elasticsearch. We will discuss this in detail in the Failure Handling section below. Since primary and replicas are the exact copies, a search query can be answered by either the primary or the replica shard. This significantly increases the number of simultaneous requests Elasticsearch can handle at any point in time.

As the index is distributed across multiple shards, a query against an index is executed in parallel across all the shards. The results from each shard are then gathered and sent back to the client. Executing the query in parallel greatly improves the search performance.

In the next section, we will discuss the relation between node, index and shard.

Relation between node, index, and shard

Shard is often the most confusing topic when I talk about Elasticsearch at conferences or to someone who has never worked on Elasticsearch. In this section, I want to focus on the relation between node, index, and shard. We will use a cluster with three nodes and create the same index with multiple shard configuration, and we will talk through the differences.

Three shards with zero replicas

We will start with an index called esintroduction with three shards and zero replicas. The distribution of the shards in a three node cluster is as follows:

In the above screenshot, shards are represented by the green squares. We will talk about replicas towards the end of this discussion. Since we have three nodes(servers) and three shards, the shards are evenly distributed across all three nodes. Each node will contain one shard. As you index your documents into the esintroduction index, data is spread across the three shards.

Six shards with zero replicas

Now, let's recreate the same esintroduction index with six shards and zero replicas. Since we have three nodes (servers) and six shards, each node will now contain two shards. The esintroduction index is split between six shards across three nodes.

The distribution of shards for an index with six shards is as follows:

The esintroduction index is spread across three nodes, meaning these three nodes will handle the index/query requests for the index. If these three nodes are not able to keep up with the indexing/search load, we can scale the esintroduction index by adding more nodes. Since the index has six shards, you could add three more nodes, and Elasticsearch automatically rearranges the shards across all six nodes. Now, index/query requests for the esintroduction index will be handled by six nodes instead of three nodes. If this is not clear, do not worry, we will discuss more about this as we progress in the book.

Six shards with one replica

Let's now recreate the same esintroduction index with six shards and one replica, meaning the index will have 6 primary shards and 6 replica shards, a total of 12 shards. Since we have three nodes (servers) and twelve shards, each node will now contain four shards. The esintroduction index is split between six shards across three nodes. The green squares represent shards in the following figure.

The solid border represents primary shards, and replicas are the dotted squares:

As we discussed before, the index is distributed into multiple shards across multiple nodes. In a distributed environment, a node/server can go down due to various reasons, such as disk failure, network issue, and so on. To ensure availability, each shard, by default, is replicated to a node other than where the primary shard exists. If the node containing the primary shard goes down, the shard replica is promoted to primary, and the data is not lost, and you can continue to operate on the index. In the preceding figure, the esintroduction index has six shards split across the three nodes. The primary of shard 2 belongs to node elasticsearch 1, and the replica of the shard 2 belongs to node elasticsearch 3. In the case of the elasticsearch 1 node going down, the replica in elasticsearch 3 is promoted to primary. This switch is completely transparent and handled by Elasticsearch.

Distributed search

One of the reasons queries executed on Elasticsearch are so fast is because they are distributed. Multiple shards act as one index. A search query on an index is executed in parallel across all the shards.

Let's take an example: in the following figure, we have a cluster with two nodes: Node1, Node2 and an index named chapter1 with two shards: S0, S1 with one replica:

Assuming the chapter1 index has 100 documents, S1 would have 50 documents, and S0 would have 50 documents. And you want to query for all the documents that contain the word Elasticsearch. The query is executed on S0 and S1 in parallel. The results are gathered back from both the shards and sent back to the client. Imagine, you have to query across million of documents, using Elasticsearch the search can be distributed. For the application I'm currently working on, a query on more than 100 million documents comes back within 50 milliseconds; which is simply not possible if the search is not distributed.

Failure handling

Elasticsearch handles failures automatically. This section describes how the failures are handled internally. Let’s say we have an index with two shards and one replica. In the following diagram, the shards represented in solid line are primary shards, and the shards in the dotted line are replicas:

As shown in preceding diagram, we initially have a cluster with two nodes. Since the index has two shards and one replica, shards are distributed across the two nodes. To ensure availability, primary and replica shards never exist in the same node. If the node containing both primary and replica shards goes down, the data cannot be recovered. In the preceding diagram, you can see that the primary shard S0 belongs to Node 1 and the replica shard S0 to the Node 2.

Next, just like we discussed in the Relation between Node, Index and Shard section, we will add two new nodes to the existing cluster, as shown here:

The cluster now contains four nodes, and the shards are automatically allocated to the new nodes. Each node in the cluster will now contain either a primary or replica shard. Now, let's say Node2, which contains the primary shard S1, goes down as shown here:

Since the node that holds the primary shard went down, the replica of S1, which lives in Node3, is promoted to primary. To ensure the replication factor of 1, a copy of the shard S1 is made on Node1. This process is known as rebalancing of the cluster.

Depending on the application, the number of shards can be configured while creating the index. The process of rebalancing the shards to other nodes is entirely transparent to the user and handled automatically by Elasticsearch.

Strengths and limitations of Elasticsearch

The strengths of Elasticsearch are as follows:

Very flexible Query API:
- It supports JSON-based REST API.
- Clients are available for all major languages, such as Java, Python, PHP, and so on.
- It supports filtering, sort, pagination, and aggregations in the same query.
Supports auto/dynamic mapping:
- In the traditional SQL world, you should predefine the table schema before you can add data. Elasticsearch handles unstructured data automatically, meaning you can index JSON documents without predefining the schema. It will try to figure out the field mappings automatically.
- Adding/removing the new/existing fields is also handled automatically.
Highly scalable:
- Clustering, replication of data, automatic failover are supported out of the box and are completely transparent to the user. For more details, refer to the Availability and Horizontal Scalability section.
Multi-language support:
- We discussed how stemming works and why it is important to remove the difference between the different forms of root words. This process is completely different for different languages. Elasticsearch supports many languages out of the box.
Aggregations:
- Aggregations are one of the reasons why Elasticsearch is like nothing out there.
- It comes with very a powerful analytics engine, which can help you slice and dice your data.
- It supports nested aggregations. For example, you can group users first by the city they live in and then by their gender and then calculate the average age of each bucket.
Performance:
- Due to the inverted index and the distributed nature, it is extremely high performing. The queries you traditionally run using a batch processing engine, such as Hadoop, can now be executed in real time.
Intelligent filter caching:
- The most recently used queries are cached. When the data is modified, the cache is invalidated automatically.

The limitations of Elasticsearch are as follows:

Not real time - eventual consistency (near real time):
- The data you index is only available for search after 1 sec. A process known as refresh wakes up every 1 sec by default and makes the data searchable.
Doesn't support SQL like joins but provides parent-child and nested to handle relations.
Doesn't support transactions and rollbacks: Transactions in a distributed system are expensive. It offers version-based control to make sure the update is happening on the latest version of the document.
Updates are expensive. An update on the existing document deletes the document and re-inserts it as a new document.
Elasticsearch might lose data due to the following reasons:
- Network partitions.
- Multiple nodes going down at the same time.
- Elasticsearch has come a long way in improving resiliency. The current status can be tracked at https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html.