Understanding Elasticsearch
Elasticsearch is a search server built on top of Lucene (licensed under Apache), which is completely written in Java. It supports distributed searches in a multitenant environment. It is a scalable search engine allowing high flexibility of adding machines easily. It provides a full-text search engine combined with a RESTful web interface and JSON documents. Elasticsearch harnesses the functionalities of Lucene Java Libraries, adding up by providing proper APIs, scalability, and flexibility on top of the Lucene full-text search library. All querying done using Elasticsearch, that is, searching text, matching text, creating indexes, and so on, is implemented by Apache Lucene.
Note
Without a setup of an Elastic shield or any other proxy mechanism, any user with access to Elasticsearch API can view all the data stored in the cluster.
The basic concepts of Elasticsearch
Let's explore some of the basic concepts of Elasticsearch:
- Field: This is the smallest single unit of data stored in Elasticsearch. It is similar to a column in a traditional relational database. Every document contains key-value pairs, which are referred to as fields. Values in a field can contain a single value, such as
integer [27]
,string ["Kibana"]
, or multiple values, such asarray [1, 2, 3, 4, 5]
. The field type is responsible for specifying which type of data can be stored in a particular field, for example,integer
,string
,date
, and so on. - Document: This is the simplest unit of information stored in Elasticsearch. It is a collection of fields. It is considered similar to a row of a table in a traditional relational database. A document can contain any type of entry, such as a document for a single restaurant, another document for a single cuisine, and yet another for a single order. Documents are in JavaScript Object Notation (JSON), which is a language-independent data interchange format. JSON contains key-value pairs. Every document that is stored in Elasticsearch is indexed. Every document contains a type and an ID. An example of a document that has JSON values is as follows:
{ "name": "Yuvraj", "age": 22, "birthdate": "2015-07-27", "bank_balance": 10500.50, "interests": ["playing games","movies","travelling"], "movie": {"name":"Titanic","genre":"Romance","year" : 1997} }
In the preceding example, we can see that the document supports JSON, having key-value pairs, which are explained as follows:
- The
name
field is of the string type - The
age
field is of the numeric type - The
birthdate
field is of the date type - The
bank_balance
field is of the float type - The
interests
field contains an array - The
movie
field contains an object (dictionary)
- The
- Type: This is similar to a table in a traditional relational database. It contains a list of fields, which is defined for every document. A type is a logical segregation of indexes, whose interpretation/semantics entirely depends on you. For example, you have data about the world and you put all your data into an index. In this index, you can define a type for continent-wise data, another type for country-wise data, and a third type for region-wise data. Types are used with a mapping API; it specifies the type of its field. An example of type mapping is as follows:
{ "user": { "properties": { "name": { "type": "string" }, "age": { "type": "integer" }, "birthdate": { "type": "date" }, "bank_balance": { "type": "float" }, "interests": { "type": "string" }, "movie": { "properties": { "name": { "type": "string" }, "genre": { "type": "string" }, "year": { "type": "integer" } } } } } }
Now, let's take a look at the core data types specified in Elasticsearch, as follows:
Type
Definition
string
This contains text, for example,
"Kibana"
integer
This contains a 32-bit integer, for example,
7
long
This contains a 64-bit integer
float
IEEE float, for example,
2.7
double
This is a double-precision float
boolean
This can be true or false
date
This is the UTC date/time, for example,
"2015-06-30T13:10:10"
geo_point
This is the latitude or longitude
- Index: This is a collection of documents (one or more than one). It is similar to a database in the analogy with traditional relational databases. For example, you can have an index for user information, transaction information, and product type. An index has a mapping; this mapping is used to define multiple types. In other words, an index can contain single or multiple types. An index is defined by a name, which is always used whenever referring to an index to perform search, update, and delete operations for documents. You can define any number of indexes you require. Indexes also act as logical namespaces that map documents to primary shards, which contain zero or more replica shards for replicating data. With respect to traditional databases, the basic analogy is similar to the following:
MySQL => Databases => Tables => Columns/Rows Elasticsearch => Indexes => Types => Documents with Fields
Note
You can store a single document or multiple documents within a type or index. As a document is within an index, it must also be assigned to a type within an index. Moreover, the maximum number of documents that you can store in a single index is 2,147,483,519 (2 billion 147 million), which is equivalent to
Integer.Max_Value
. - ID: This is an identifier for a document. It is used to identify each document. If it is not defined, it is autogenerated for every document.
Note
The combination of index, type, and ID must be unique for each document.
- Mapping: Mappings are similar to schemas in a traditional relational database. Every document in an index has a type. A mapping defines the fields, the data type for each field, and how the field should be handled by Elasticsearch. By default, a mapping is automatically generated whenever a document is indexed. If the default settings are overridden, then the mapping's definition has to be provided explicitly.
- Node: This is a running instance of Elasticsearch. Each node is part of a cluster. On a standalone machine, each Elasticsearch server instance corresponds to a node. Multiple nodes can be started on a single standalone machine or a single cluster. The node is responsible for storing data and helps in the indexing/searching capabilities of a cluster. By default, whenever a node is started, it is identified and assigned a random Marvel Comics character name. You can change the configuration file to name nodes as per your requirement. A node also needs to be configured in order to join a cluster, which is identifiable by the cluster name. By default, all nodes join the Elasticsearch cluster; that is, if any number of nodes are started up on a network/machine, they will automatically join the Elasticsearch cluster.
- Cluster: This is a collection of nodes and has one or multiple nodes; they share a single cluster name. Each cluster automatically chooses a master node, which is replaced if it fails; that is, if the master node fails, another random node will be chosen as the new master node, thus providing high availability. The cluster is responsible for holding all of the data stored and provides a unified view for search capabilities across all nodes. By default, the cluster name is Elasticsearch, and it is the identifiable parameter for all nodes in a cluster. All nodes, by default, join the Elasticsearch cluster. While using a cluster in the production phase, it is advisable to change the cluster name for ease of identification, but the default name can be used for any other purpose, such as development or testing.
Note
The Elasticsearch cluster contains single or multiple indexes, which contain single or multiple types. All types contain single or multiple documents, and every document contains single or multiple fields.
- Sharding: This is an important concept of Elasticsearch while understanding how Elasticsearch allows scaling of nodes, when having a large amount of data termed as big data. An index can store any amount of data, but if it exceeds its disk limit, then searching would become slow and be affected. For example, the disk limit is 1 TB, and an index contains a large number of documents, which may not fit completely within 1 TB in a single node. To counter such problems, Elasticsearch provides shards. These break the index into multiple pieces. Each shard acts as an independent index that is hosted on a node within a cluster. Elasticsearch is responsible for distributing shards among nodes. There are two purposes of sharding: allowing horizontal scaling of the content volume, and improving performance by providing parallel operations across various shards that are distributed on nodes (single or multiple, depending on the number of nodes running).
Note
Elasticsearch helps move shards among multiple nodes in the event of an addition of new nodes or a node failure.
There are two types of shards, as follows:
- Primary shard: Every document is stored within a primary index. By default, every index has five primary shards. This parameter is configurable and can be changed to define more or fewer shards as per the requirement. A primary shard has to be defined before the creation of an index. If no parameters are defined, then five primary shards will automatically be created.
Note
Whenever a document is indexed, it is usually done on a primary shard initially, followed by replicas. The number of primary shards defined in an index cannot be altered once the index is created.
- Replica shard: Replica shards are an important feature of Elasticsearch. They help provide high availability across nodes in the cluster. By default, every primary shard has one replica shard. However, every primary shard can have zero or more replica shards as required. In an environment where failure directly affects the enterprise, it is highly recommended to use a system that provides a failover mechanism to achieve high availability. To counter this problem, Elasticsearch provides a mechanism in which it creates single or multiple copies of indexes, and these are termed as replica shards or replicas. A replica shard is a full copy of the primary shard. Replica shards can be dynamically altered. Now, let's see the purposes of creating a replica. It provides high availability in the event of failure of a node or a primary shard. If there is a failure of a primary shard, replica shards are automatically promoted to primary shards. Increase performance by providing parallel operations on replica shards to handle search requests.
Note
A replica shard is never kept on the same node as that of the primary shard from which it was copied.
- Primary shard: Every document is stored within a primary index. By default, every index has five primary shards. This parameter is configurable and can be changed to define more or fewer shards as per the requirement. A primary shard has to be defined before the creation of an index. If no parameters are defined, then five primary shards will automatically be created.
- Inverted index: This is also a very important concept in Elasticsearch. It is used to provide fast full-text search. Instead of searching text, it searches for an index. It creates an index that lists unique words occurring in a document, along with the document list in which each word occurs. For example, suppose we have three documents. They have a text field, and it contains the following:
- I am learning Kibana
- Kibana is an amazing product
- Kibana is easy to use
To create an inverted index, the text field is broken into words (also known as terms), a list of unique words is created, and also a listing is done of the document in which the term occurs, as shown in this table:
Term
Doc 1
Doc 2
Doc 3
I
X
Am
X
Learning
X
Kibana
X
X
X
Is
X
X
An
X
Amazing
X
Product
X
Easy
X
To
X
Use
X
Now, if we search for
is Kibana
, Elasticsearch will use an inverted index to display the results:Term
Doc 1
Doc 2
Doc 3
Is
X
X
Kibana
X
X
X
With inverted indexes, Elasticsearch uses the functionality of Lucene to provide fast full-text search results.
Note
An inverted index uses an index based on keywords (terms) instead of a document-based index.
- REST API: This stands for Representational State Transfer. It is a stateless client-server protocol that uses HTTP requests to store, view, and delete data. It supports CRUD operations (short for Create, Read, Update, and Delete) using HTTP. It is used to communicate with Elasticsearch and is implemented by all languages. It communicates with Elasticsearch over port
9200
(by default), which is accessible from any web browser. Also, Elasticsearch can be directly communicated with via the command line using thecurl
command. cURL is a command-line tool used to send, view, or delete data using URL syntax, as followed by the HTTP structure. A cURL request is similar to an HTTP request, which is as follows:curl -X <VERB> '<PROTOCOL>://<HOSTNAME>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>'
The terms marked within the
<>
tags are variables, which are defined as follows:- VERB: This is used to provide an appropriate HTTP method, such as GET (to get data), POST, PUT (to store data), or DELETE (to delete data).
- PROTOCOL: This is used to define whether the HTTP or HTTPS protocol is used to send requests.
- HOSTNAME: This is used to define the hostname of a node present in the Elasticsearch cluster. By default, the hostname of Elasticsearch is
localhost
. - PORT: This is used to define the port on which Elasticsearch is running. By default, Elasticsearch runs on port
9200
. - PATH: This is used to define the index, type, and ID where the documents will be stored, searched, or deleted. It is specified as index/type/ID.
- QUERY_STRING: This is used to define any additional query parameter for searching data.
- BODY: This is used to define a JSON-encoded request within the body.
In order to put data into Elasticsearch, the following
curl
command is used:curl -XPUT 'http://localhost:9200/testing/test/1' -d '{"name": "Kibana" }'
Here,
testing
is the name of the index,test
is the name of the type within the index, and1
indicates the ID number.To search for the preceding stored data, the following
curl
command is used:curl -XGET 'http://localhost:9200/testing/_search?
Note
The preceding commands are provided just to give you an overview of the format of the
curl
command.