Elasticsearch Server: Second Edition

Chapter 1. Getting Started with the Elasticsearch Cluster

Welcome to the wonderful world of Elasticsearch—a great full text search and analytics engine. It doesn't matter if you are new to Elasticsearch and full text search in general or if you have experience. We hope that by reading this book you'll be able to learn and extend your knowledge of Elasticsearch. As this book is also dedicated to beginners, we decided to start with a short introduction to full text search in general and after that, a brief overview of Elasticsearch.

The first thing we need to do with Elasticsearch is install it. With many applications, you start with the installation and configuration and usually forget the importance of those steps. We will try to guide you through these steps so that it becomes easier to remember. In addition to this, we will show you the simplest way to index and retrieve data without getting into too many details. By the end of this chapter, you will have learned the following topics:

Full-text searching
Understanding Apache Lucene
Performing text analysis
Learning the basic concepts of Elasticsearch
Installing and configuring Elasticsearch
Using the Elasticsearch REST API to manipulate data
Searching using basic URI requests

Full-text searching

Back in the days when full-text searching was a term known to a small percentage of engineers, most of us used SQL databases to perform search operations. Of course, it is ok, at least to some extent. However, as you go deeper and deeper, you start to see the limits of such an approach. Just to mention some of them—lack of scalability, not enough flexibility, and lack of language analysis (of course there were additions that introduced full-text searching to SQL databases). These were the reasons why Apache Lucene (http://lucene.apache.org) was created—to provide a library of full text search capabilities. It is very fast, scalable, and provides analysis capabilities for different languages.

The Lucene glossary and architecture

Before going into the details of the analysis process, we would like to introduce you to the glossary for Apache Lucene and the overall architecture of Apache Lucene. The basic concepts of the mentioned library are as follows:

Document: This is a main data carrier used during indexing and searching, comprising one or more fields that contain the data we put in and get from Lucene.
Field: This is a section of the document which is built of two parts; the name and the value.
Term: This is a unit of search representing a word from the text.
Token: This is an occurrence of a term in the text of the field. It consists of the term text, start and end offsets, and a type.

Apache Lucene writes all the information to the structure called inverted index. It is a data structure that maps the terms in the index to the documents and not the other way around as the relational database does in its tables. You can think of an inverted index as a data structure where data is term-oriented rather than document-oriented. Let's see how a simple inverted index will look. For example, let's assume that we have the documents with only the title field to be indexed and they look as follows:

Elasticsearch Server 1.0 (document 1)
Mastering Elasticsearch (document 2)
Apache Solr 4 Cookbook (document 3)

So, the index (in a very simplified way) can be visualized as follows:

Each term points to the number of documents it is present in. This allows a very efficient and fast searching, such as the term-based queries. In addition to this, each term has a number connected to it, count, telling Lucene how often the term occurs.

Of course, the actual index created by Lucene is much more complicated and advanced because of additional files that include information such as term vectors, doc values, and so on. However, all you need to know for now is how the data is organized and not what is exactly stored.

Each index is divided into multiple write once and read many time segments. When indexing, after a single segment is written to the disk, it can't be updated. Therefore, the information on deleted documents is stored in a separate file, but the segment itself is not updated.

However, multiple segments can be merged together through a process called segments merge. After forcing the segments to merge or after Lucene decides that it is time to perform merging, the segments are merged together by Lucene to create larger ones. This can demand I/O; however, some information needs to be cleaned up because during this time, information that is not needed anymore will be deleted (for example, the deleted documents). In addition to this, searching with one large segment is faster than searching with multiple smaller ones holding the same data. That's because, in general, to search means to just match the query terms to the ones that are indexed. You can imagine how searching through multiple small segments and merging those results will be slower than having a single segment preparing the results.

Input data analysis

Of course, the question that arises is how the data that is passed in the documents is transformed into the inverted index and how the query text is changed into terms to allow searching. The process of transforming this data is called analysis. You may want some of your fields to be processed by a language analyzer so that words such as car and cars are treated as the same in your index. On the other hand, you may want other fields to be only divided on the white space or only lowercased.

Analysis is done by the analyzer, which is built of a tokenizer and zero or more token filters, and it can also have zero or more character mappers.

A tokenizer in Lucene is used to split the text into tokens, which are basically the terms with additional information, such as its position in the original text and its length. The results of the tokenizer's work is called a token stream, where the tokens are put one by one and are ready to be processed by the filters.

Apart from the tokenizer, the Lucene analyzer is built of zero or more token filters that are used to process tokens in the token stream. Some examples of filters are as follows:

Lowercase filter: This makes all the tokens lowercased
Synonyms filter: This is responsible for changing one token to another on the basis of synonym rules
Multiple language stemming filters: These are responsible for reducing tokens (actually, the text part that they provide) into their root or base forms, the stem

Filters are processed one after another, so we have almost unlimited analysis possibilities with the addition of multiple filters one after another.

Finally, the character mappers operate on non-analyzed text—they are used before the tokenizer. Therefore, we can easily remove HTML tags from whole parts of text without worrying about tokenization.

Indexing and querying

We may wonder how all the preceding functionalities affect indexing and querying when using Lucene and all the software that is built on top of it. During indexing, Lucene will use an analyzer of your choice to process the contents of your document; of course, different analyzers can be used for different fields, so the name field of your document can be analyzed differently compared to the summary field. Fields may not be analyzed at all, if we want.

During a query, your query will be analyzed. However, you can also choose not to analyze your queries. This is crucial to remember because some of the Elasticsearch queries are analyzed and some are not. For example, the prefix and the term queries are not analyzed, and the match query is analyzed. Having the possibility to chose from the queries that are analyzed and the ones that are not analyzed are very useful; sometimes, you may want to query a field that is not analyzed, while sometimes you may want to have a full text search analysis. For example, if we search for the LightRed term and the query is being analyzed by the standard analyzer, then the terms that would be searched are light and red. If we use a query type that has not been analyzed, then we will explicitly search for the LightRed term.

What you should remember about indexing and querying analysis is that the index should match the query term. If they don't match, Lucene won't return the desired documents. For example, if you are using stemming and lowercasing during indexing, you need to ensure that the terms in the query are also lowercased and stemmed, or your queries wouldn't return any results at all. It is important to keep the token filters in the same order during indexing and query time analysis so that the terms resulting of such an analysis are the same.

Scoring and query relevance

There is one additional thing we haven't mentioned till now—scoring. What is the score of a document? The score is a result of a scoring formula that describes how well the document matches the query. By default, Apache Lucene uses the TF/IDF (term frequency / inverse document frequency) scoring mechanism—an algorithm that calculates how relevant the document is in the context of our query. Of course, it is not the only algorithm available, and we will mention other algorithms in the Mappings configuration section of Chapter 2, Indexing Your Data.

Note

If you want to read more about the Apache Lucene TF/IDF scoring formula, please visit Apache Lucene Javadocs for the TFIDFSimilarity class available at http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html.

Remember though that the higher the score value calculated by Elasticsearch and Lucene, the more relevant is the document. The score calculation is affected by parameters such as boost, by different query types (we will discuss these query types in the Basic queries section of Chapter 3, Searching Your Data), or by using different scoring algorithms.

Note

If you want to read more detailed information about how Apache Lucene scoring works, what the default algorithm is, and how the score is calculated, please refer to our book, Mastering ElasticSearch, Packt Publishing.

The basics of Elasticsearch

Elasticsearch is an open source search server project started by Shay Banon and published in February 2010. During this time, the project has grown into a major player in the field of search and data analysis solutions and is widely used in many more or lesser-known search applications. In addition, due to its distributed nature and real-time capabilities, many people use it as a document store.

Key concepts of data architecture

Let's go through the basic concepts of Elasticsearch. You can skip this section if you are already familiar with the Elasticsearch architecture. However, if you are not familiar with this architecture, consider reading this section. We will refer to the key words used in the rest of the book.

Index

Index is the logical place where Elasticsearch stores logical data, so that it can be divided into smaller pieces. If you come from the relational database world, you can think of an index like a table. However, the index structure is prepared for fast and efficient full-text searching, and in particular, does not store original values. If you know MongoDB, you can think of the Elasticsearch index as a collection in MongoDB. If you are familiar with CouchDB, you can think about an index as you would about the CouchDB database. Elasticsearch can hold many indices located on one machine or spread over many servers. Every index is built of one or more shards, and each shard can have many replicas.

Document

The main entity stored in Elasticsearch is a document. Using the analogy to relational databases, a document is a row of data in a database table. When you compare an Elasticsearch document to a MongoDB document, you will see that both can have different structures, but the document in Elasticsearch needs to have the same type for all the common fields. This means that all the documents with a field called title need to have the same data type for it, for example, string.

Documents consist of fields, and each field may occur several times in a single document (such a field is called multivalued). Each field has a type (text, number, date, and so on). The field types can also be complex: a field can contain other subdocuments or arrays. The field type is important for Elasticsearch because it gives information about how various operations such as analysis or sorting should be performed. Fortunately, this can be determined automatically (however, we still suggest using mappings). Unlike the relational databases, documents don't need to have a fixed structure—every document may have a different set of fields, and in addition to this, fields don't have to be known during application development. Of course, one can force a document structure with the use of schema. From the client's point of view, a document is a JSON object (see more about the JSON format at http://en.wikipedia.org/wiki/JSON). Each document is stored in one index and has its own unique identifier (which can be generated automatically by Elasticsearch) and document type. A document needs to have a unique identifier in relation to the document type. This means that in a single index, two documents can have the same unique identifier if they are not of the same type.

Document type

In Elasticsearch, one index can store many objects with different purposes. For example, a blog application can store articles and comments. The document type lets us easily differentiate between the objects in a single index. Every document can have a different structure, but in real-world deployments, dividing documents into types significantly helps in data manipulation. Of course, one needs to keep the limitations in mind; that is, different document types can't set different types for the same property. For example, a field called title must have the same type across all document types in the same index.

Mapping

In the section about the basics of full-text searching (the Full-text searching section), we wrote about the process of analysis—the preparation of input text for indexing and searching. Every field of the document must be properly analyzed depending on its type. For example, a different analysis chain is required for the numeric fields (numbers shouldn't be sorted alphabetically) and for the text fetched from web pages (for example, the first step would require you to omit the HTML tags as it is useless information—noise). Elasticsearch stores information about the fields in the mapping. Every document type has its own mapping, even if we don't explicitly define it.

Key concepts of Elasticsearch

Now, we already know that Elasticsearch stores data in one or more indices. Every index can contain documents of various types. We also know that each document has many fields and how Elasticsearch treats these fields is defined by mappings. But there is more. From the beginning, Elasticsearch was created as a distributed solution that can handle billions of documents and hundreds of search requests per second. This is due to several important concepts that we are going to describe in more detail now.

Node and cluster

Elasticsearch can work as a standalone, single-search server. Nevertheless, to be able to process large sets of data and to achieve fault tolerance and high availability, Elasticsearch can be run on many cooperating servers. Collectively, these servers are called a cluster, and each server forming it is called a node.

Shard

When we have a large number of documents, we may come to a point where a single node may not be enough—for example, because of RAM limitations, hard disk capacity, insufficient processing power, and inability to respond to client requests fast enough. In such a case, data can be divided into smaller parts called shards (where each shard is a separate Apache Lucene index). Each shard can be placed on a different server, and thus, your data can be spread among the cluster nodes. When you query an index that is built from multiple shards, Elasticsearch sends the query to each relevant shard and merges the result in such a way that your application doesn't know about the shards. In addition to this, having multiple shards can speed up the indexing.

Replica

In order to increase query throughput or achieve high availability, shard replicas can be used. A replica is just an exact copy of the shard, and each shard can have zero or more replicas. In other words, Elasticsearch can have many identical shards and one of them is automatically chosen as a place where the operations that change the index are directed. This special shard is called a primary shard, and the others are called replica shards. When the primary shard is lost (for example, a server holding the shard data is unavailable), the cluster will promote the replica to be the new primary shard.

Gateway

Elasticsearch handles many nodes. The cluster state is held by the gateway. By default, every node has this information stored locally, which is synchronized among nodes. We will discuss the gateway module in The gateway and recovery modules section of Chapter 7, Elasticsearch Cluster in Detail.

Indexing and searching

You may wonder how you can practically tie all the indices, shards, and replicas together in a single environment. Theoretically, it should be very difficult to fetch data from the cluster when you have to know where is your document, on which server, and in which shard. Even more difficult is searching when one query can return documents from different shards placed on different nodes in the whole cluster. In fact, this is a complicated problem; fortunately, we don't have to care about this—it is handled automatically by Elasticsearch itself. Let's look at the following diagram:

When you send a new document to the cluster, you specify a target index and send it to any of the nodes. The node knows how many shards the target index has and is able to determine which shard should be used to store your document. Elasticsearch can alter this behavior; we will talk about this in the Routing section of Chapter 2, Indexing Your Data. The important information that you have to remember for now is that Elasticsearch calculates the shard in which the document should be placed using the unique identifier of the document. After the indexing request is sent to a node, that node forwards the document to the target node, which hosts the relevant shard.

Now let's look at the following diagram on searching request execution:

When you try to fetch a document by its identifier, the node you send the query to uses the same routing algorithm to determine the shard and the node holding the document and again forwards the query, fetches the result, and sends the result to you. On the other hand, the querying process is a more complicated one. The node receiving the query forwards it to all the nodes holding the shards that belong to a given index and asks for minimum information about the documents that match the query (identifier and score, by default), unless routing is used, where the query will go directly to a single shard only. This is called the scatter phase. After receiving this information, the aggregator node (the node that receives the client request) sorts the results and sends a second request to get the documents that are needed to build the results list (all the other information apart from the document identifier and score).

This is called the gather phase. After this phase is executed, the results are returned to the client.

Now the question arises—what is the role of replicas in the process described previously? While indexing, replicas are only used as an additional place to store the data. When executing a query, by default, Elasticsearch will try to balance the load among the shard and its replicas so that they are evenly stressed. Also, remember that we can change this behavior; we will discuss this in the Understanding the querying process section of Chapter 3, Searching Your Data.

Filter reviews by

All

Amazon verified reviews

Silverhawk May 04, 2014

This book is an excellent read for both beginning and experienced ElasticSearch Users.It goes into detail about what ElasticSearch is, how it works and setting it up in a single or clustered environment.The book covers the different APIs available and explains some of the best practices for architecting and searching your data.It also explains some more complex topics such as extending your index structures to handle data that isn't flat, index aliasing and detecting the language of documents.The one thing I found missing was securing ElasticSearch clusters but it seems in general this is something that is a shortcoming with the ElasticSearch product as a whole so I can't take away from this author for not covering it.

Amazon Verified review

A. Pryor Jun 02, 2014

A powerful guide to get you building search indexes fast! I'd suggest this to anyone interested in search from entry level engineers through experienced architects. I use this book as a reference for my team, it's incredible in-depth and has excellent API examples. It's up to date with all the new features in the Elasticsearch, and the section on aggregations is particularly interesting to me.It covers enough Lucene if you don't have any experience, but also dives into the more complex topics quickly with example code.Thorough demonstration of search API's of Elasticsearch. Everything from indexing, querying, relevance, highlights. Percolators and aggregators explained.Covers deploying and maintaining a cluster.Overall a great reference, well written, good purchase.

William Tak Shing, Wong May 13, 2014

This is definitely an easy to read book for ElasticSearch. It's focus is on giving a complete elastcisearch reference for both beginning and intermediate ElasticSearch users. The author has a lot of references to the his other book Mastering Elasticsearch which I don't have a copy and can't comment. For advance users, I would recommend using the Elasticsearch.org reference.I have both the first and second edition of this book. For the second edition, it includes all the latest features for elasticsearch 1.0. I especially like the chapter 7 Elasticsearch Cluster in Details. It talks about the elasticsearch clustering in details and gives cluster tuning guidelines. This is extremely useful for any production deployment. It will be great if some real life examples are included in this section as well.If you are new to Elasticsearch, the best way to learn is to download a copy and try out with a few online tutorials. Once the basic concepts are there, reading this book will help tremendously.

drizzt Jun 19, 2014

It's probably the best book ever written about Elasticsearch.Chapter 3 (Searching Your Data) is so clear and complete even a total newbie can use it, and together with Chapters 4 and 5 you can really improve your Search stuff.The only BIG thing that could be better explained is the Percolator, I suggest you to read also the Elasticsearch post about the last improvements on it.

George Apr 29, 2015

Very solid book

Elasticsearch Server: Second Edition: From creating your own index structure through to cluster monitoring and troubleshooting, this is the complete guide to implementing the ElasticSearch search engine on your own websites. Packed with real-life examples.

What do you get with eBook?