You're reading from Learning Elasticsearch Structured and unstructured data using distributed real-time search and analytics

Product type Paperback

Published in Jun 2017

Publisher Packt

ISBN-13 9781787128453

Length 404 pages

Edition 1st Edition

Tools

Elasticsearch

Concepts

Enterprise Search

Author (1):

Abhishek Andhavarapu

View More author details

Basic concepts of Elasticsearch

Elasticsearch is a highly scalable open source search engine. Although it started as a text search engine, it is evolving as an analytical engine, which can support not only search but also complex aggregations. Its distributed nature and ease of use makes it very easy to get started and scale as you have more data.

One might ask what makes Elasticsearch different from any other document stores out there. Elasticsearch is a search engine and not just a key-value store. It's also a very powerful analytical engine; all the queries that you would usually run in a batch or offline mode can be executed in real time. Support for features such as autocomplete, geo-location based filters, multilevel aggregations, coupled with its user friendliness resulted in industry-wide acceptance. That being said, I always believe it is important to have the right tool for the right job. Towards the end of the chapter, we will discuss it’s strengths and limitations.

In this section, we will go through the basic concepts and terminology of Elasticsearch. We will start by explaining how to insert, update, and perform a search. If you are familiar with SQL language, the following table shows the equivalent terms in Elasticsearch:

Database	Table	Row	Column
Index	Type	Document	Field

Document

Your data in Elasticsearch is stored as JSON (Javascript Object Notation) documents. Most NoSQL data stores use JSON to store their data as JSON format is very concise, flexible, and readily understood by humans. A document in Elasticsearch is very similar to a row when compared to a relational database. Let's say we have a User table with the following information:

Id	Name	Age	Gender	Email
1	Luke	100	M	luke@gmail.com
2	Leia	100	F	leia@gmail.com

The users in the preceding user table, when represented in JSON format, will look like the following:

{
   "id": 1,
   "name": "Luke",
   "age": 100,
   "gender": "M",
   "email": "luke@gmail.com"
 }

{
   "id": 2,
   "name": "Leia",
   "age": 100,
   "gender": "F",
   "email": "leia@gmail.com"
 }

A row contains columns; similarly, a document contains fields. Elasticsearch documents are very flexible and support storing nested objects. For example, an existing user document can be easily extended to include the address information. To capture similar information using a table structure, you need to create a new address table and manage the relations using a foreign key. The user document with the address is shown here:

{
   "id": 1,
   "name": "Luke",
   "age": 100,
   "gender": "M",
   "email": "luke@gmail.com",
   "address": {
     "street": "123 High Lane",
     "city": "Big City",
     "state": "Small State",
     "zip": 12345
   }
 }

Reading similar information without the JSON structure would also be difficult as the information would have to be read from multiple tables. Elasticsearch allows you to store the entire JSON as it is. For a database table, the schema has to be defined before you can use the table. Elasticsearch is built to handle unstructured data and can automatically determine the data types for the fields in the document. You can index new documents or add new fields without adding or changing the schema. This process is also known as dynamic mapping. We will discuss how this works and how to define schema in Chapter 3, Modeling Your Data and Document Relations.

Index

An index is similar to a database. The term index should not be confused with a database index, as someone familiar with traditional SQL might assume. Your data is stored in one or more indexes just like you would store it in one or more databases. The word indexing means inserting/updating the documents into an Elasticsearch index. The name of the index must be unique and typed in all lowercase letters. For example, in an e-commerce world, you would have an index for the items--one for orders, one for customer information, and so on.

Type

A type is similar to a database table, an index can have one or more types. Type is a logical separation of different kinds of data. For example, if you are building a blog application, you would have a type defined for articles in the blog and a type defined for comments in the blog. Let's say we have two types--articles and comments.

The following is the document that belongs to the article type:

{
   "articleid": 1,
   "name": "Introduction to Elasticsearch"
 }

The following is the document that belongs to the comment type:

{
   "commentid": "AVmKvtPwWuEuqke_aRsm",
   "articleid": 1,
   "comment": "Its Awesome !!"
 }

We can also define relations between different types. For example, a parent/child relation can be defined between articles and comments. An article (parent) can have one or more comments (children). We will discuss relations further in Chapter 3, Modeling Your Data and Document Relations.

Cluster and node

In a traditional database system, we usually have only one server serving all the requests. Elasticsearch is a distributed system, meaning it is made up of one or more nodes (servers) that act as a single application, which enables it to scale and handle load beyond what a single server can handle. Each node (server) has part of the data. You can start running Elasticsearch with just one node and add more nodes, or, in other words, scale the cluster when you have more data. A cluster with three nodes is shown in the following diagram:

In the preceding diagram, the cluster has three nodes with the names elasticsearch1, elasticsearch2, elasticsearch3. These three nodes work together to handle all the indexing and query requests on the data. Each cluster is identified by a unique name, which defaults to elasticsearch. It is often common to have multiple clusters, one for each environment, such as staging, pre-production, production.

Just like a cluster, each node is identified by a unique name. Elasticsearch will automatically assign a unique name to each node if the name is not specified in the configuration. Depending on your application needs, you can add and remove nodes (servers) on the fly. Adding and removing nodes is seamlessly handled by Elasticsearch.

We will discuss how to set up an Elasticsearch cluster in Chapter 2, Setting Up Elasticsearch and Kibana.

Shard

An index is a collection of one or more shards. All the data that belongs to an index is distributed across multiple shards. By spreading the data that belongs to an index to multiple shards, Elasticsearch can store information beyond what a single server can store. Elasticsearch uses Apache Lucene internally to index and query the data. A shard is nothing but an Apache Lucene instance. We will discuss Apache Lucene and why Elasticsearch uses Lucene in the How search works section later.

I know we introduced a lot of new terms in this section. For now, just remember that all data that belongs to an index is spread across one or more shards. We will discuss how shards work in the Scalability and Availability section towards the end of this chapter.

You're reading from Learning Elasticsearch Structured and unstructured data using distributed real-time search and analytics

Table of Contents (11) Chapters

Basic concepts of Elasticsearch

Document

Index

Type

Cluster and node

Shard

Authors (1)

Other recommended products

Personalised recommendations for you