You're reading from Learning Elastic Stack 6.0 A beginner's guide to distributed search, analytics, and visualization using Elasticsearch, Logstash and Kibana

Product type Paperback

Published in Dec 2017

Publisher Packt

ISBN-13 9781787281868

Length 434 pages

Edition 1st Edition

Tools

Elasticsearch

Concepts

Enterprise Search

Authors (2):

Sharath Kumar

Pranav Shukla

View More author details

What is Elasticsearch, and why use it?

Since you are reading this book, you probably already know what Elasticsearch is. For the sake of completeness, let us define Elasticsearch.

Elasticsearch is a realtime, distributed search and analytics engine that is horizontally scalable and capable of solving a wide variety of use cases. At the heart of Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected.

Elasticsearch is at the core of Elastic Stack, playing the central role of a search and analytics engine. Elasticsearch is built on a radically different technology, Apache Lucene. This fundamentally different technology in Elasticsearch sets it apart from traditional relational databases and other NoSQL solutions. Let us look at the key benefits of using Elasticsearch as your data store:

Schemaless, document-oriented
Searching
Analytics
Rich client library support and the REST API
Easy to operate and easy to scale
Near real time
Lightning fast
Fault tolerant

Let us look at each benefit one by one.

Schemaless and document-oriented

Elasticsearch does not impose a strict structure on your data; you can store any JSON documents. JSON documents are first class citizens in Elasticsearch as opposed to rows and columns in a relational database. A document is roughly equivalent to a record in a relational database table. Traditional relational databases require a schema to be defined beforehand to specify a fixed set of columns and their datatypes and sizes. Often the nature of data is very dynamic, requiring support for new or dynamic columns. The JSON documents naturally support this type of data. For example, take a look at the following document:

{
 "name": "John Smith",
 "address": "121 John Street, NY, 10010",
 "age": 40
}

This document may represent a customer's record. Here the record has the name, address, and age of the customer. Another record may look like the following one:

{
 "name": "John Doe",
 "age": 38,
 "email": "john.doe@company.org"
}

Note that the second customer doesn't have the address field, but instead has an email address. In fact, other customer documents may have completely different sets of fields. This provides a tremendous amount of flexibility in terms of what can be stored.

Searching

The core strength of Elasticsearch lies in its text processing capabilities. Elasticsearch is great at searching, especially a full-text search. Let us understand what a full-text search is.

Full-text search means searching through all the terms of all the documents available in the database. This requires the entire contents of all documents to be parsed and stored beforehand. When you hear full-text search, think of Google Search. You can enter any search term and Google looks through all of the web pages on the internet to find the best matching web pages. This is quite different from simple SQL queries run against columns of type string in relational databases. Normal SQL queries with a WHERE clause and an equals (=) or LIKE clause try to do an exact or wild-card match with underlying data. SQL queries can, at best, just match the search term to a sub-string within the text column.

When you want to perform a search similar to Google search on your own data, Elasticsearch is your best bet. You can index emails, text documents, PDF files, web pages, or practically any unstructured text documents and search across all your documents with search terms.

At a high level, Elasticsearch breaks up text data into terms and makes every term searchable by building Lucene indexes. You can build your own Google-like search for your application which is very fast and flexible.

In addition to supporting text data, Elasticsearch also supports other data types such as numbers, dates, geolocations, IP addresses, and many more. We will take an in-depth look at search in Chapter 3, Searching-What is Relevant.

Analytics

Apart from search, the second most important functional strength of Elasticsearch is analytics. Yes, what was originally known just as a full-text search engine is now used as an analytics engine in a variety of use cases. Many organizations are running analytics solutions powered by Elasticsearch in production.

Search is like zooming in and finding a needle in a haystack. Search helps zoom in on precisely what is needed in huge amounts of data. Analytics is exactly the opposite of search; it is about zooming out and taking a look at the bigger picture. For example, you may want to know how many visitors on your website are from the United States as opposed to every other country, or you may want to know how many of your websites visitors use macOS, Windows, or Linux.

Elasticsearch supports a wide variety of aggregations for analytics. Elasticsearch aggregations are quite powerful and can be applied to various datatypes. We will take a look at the analytics capabilities of Elasticsearch in Chapter 4, Analytics with Elasticsearch.

Rich client library support and the REST API

Elasticsearch has very rich client library support to make it accessible by many programming languages. There are client libraries available for Java, C#, Python, JavaScript, PHP, Perl, Ruby, and many more. Apart from the official client libraries, there are community driven libraries for 20 plus programming languages.

Additionally, it has a very rich REST (Representational State Transfer) API which works on an HTTP protocol. The REST API is very well documented and quite comprehensive, making all operations available over HTTP.

All this means that Elasticsearch is very easy to integrate in any application to fulfill your search and analytics needs.

Easy to operate and easy to scale

Elasticsearch can run on a single node and easily scale out to hundreds of nodes. It is very easy to start a single node instance of Elasticsearch; it works out of the box without any configuration changes and scales to hundreds of nodes.

Horizontal scalability is the ability to scale a system horizontally by starting up multiple instances of the same type rather than making one instance more and more powerful. Vertical scaling is about upgrading a single instance by adding more processing power (by increasing the number of CPUs or CPU cores), memory, or storage capacity. There is a practical limit to how much a system can be scaled vertically due to cost and other factors, such as the availability of higher end hardware.

Unlike most traditional databases which only allow vertical scaling, Elasticsearch can be scaled horizontally. It can run on tens or hundreds of commodity nodes instead of one extremely expensive server. Adding a node to an existing Elasticsearch cluster is as easy as starting up a new node in the same network, with virtually no extra configuration. The client application doesn't need to change, whether it is running against a single node or a hundred node cluster.

Near real time

Data is available for querying typically within a second after it has been indexed (saved). Not all big data storage systems are real-time capable. Elasticsearch allows you to index thousands to hundreds of thousands of documents per second and makes them available for searching almost immediately.

Lightning fast

Elasticsearch uses Apache Lucene as its underlying technology. By default, Elasticsearch indexes all the fields of your documents. This is extremely invaluable as you can query or search by any field in your records. You will never be in a situation in which you think if only I had chosen to create an index on this field. Elasticsearch contributors have leveraged Apache Lucene to its best advantage, and there are other optimizations which make it lightning fast.

Fault tolerant

Elasticsearch clusters can keep running even when there are hardware failures such as node failure and network failure. In the case of node failure, it replicates all the data that was on the failed node to another node in the cluster. In the case of network failure, Elasticsearch seamlessly elects master replicas to keep the cluster running. Whether it is node or network failure, you can rest assured that your data is safe.

Now that you know when and why Elasticsearch could be a great choice, let us take a high level view of the ecosystem—the Elastic Stack.