An overview of Lucene
In order to fully understand how Elasticsearch works, especially when it comes to indexing and query processing, it is crucial to understand how the Apache Lucene library works. Under the hood, Elasticsearch uses Lucene to handle document indexing. The same library is also used to perform a search against the indexed documents. In the next few pages, we will try to show you the basics of Apache Lucene, just in case you've never used it.
Lucene is a mature, open source, highly performing, scalable, light and, yet, very powerful library written in Java. Its core comes as a single file of the Java library with no dependencies, and allows you to index documents and search them with its out-of-the-box full text search capabilities. Of course, there are extensions to Apache Lucene that allow different language handling, and enable spellchecking, highlighting, and much more, but if you don't need those features, you can download a single file and use it in your application.
Getting deeper into the Lucene index
In order to fully understand Lucene, the following terminologies need to be understood first:
- Document: This is a main data carrier used during indexing and search, containing one or more fields, which contains the data we put and get from Lucene.
- Field: This is a section of the document which is built of two parts: the name and the value.
- Term: This is a unit of search representing a word from the text.
- Token: This is an occurrence of a term from the text of the field. It consists of term text, start and end offset, and a type.
Inverted index
Apache Lucene writes all the information to the structure called the inverted index. It is a data structure that maps the terms in the index to the documents, not the other way round, as the relational database does. You can think of an inverted index as a data structure, where data is term oriented rather than document oriented.
Let's see how a simple inverted index can look. For example, let's assume that we have the documents with only the title field to be indexed, and they look like the following:
- Elasticsearch Server (document 1)
- Mastering Elasticsearch (document 2)
- Elasticsearch Essentials (document 3)
So, the index (in a very simple way) could be visualized as shown in the following table:
Term |
Count |
Document : Position |
Elasticsearch |
3 |
1:1, 2:2, 3:1 |
Essentials |
1 |
3:2 |
Mastering |
1 |
2:1 |
Server |
1 |
1:2 |
As you can see, each term points to the number of documents it is present in, along with its position. This allows for a very efficient and fast search such as term-based queries. In addition to this, each term has a number connected to it: the count, telling Lucene how often it occurs.
Segments
Each index is divided into multiple write once and read many times segments. When indexing, after a single segment is written to disk, it can't be updated. For example, the information about deleted documents is stored in a separate file, but the segment itself is not updated.
However, multiple segments can be merged together in a process called segments merge. After forcing, segments are merged, or after Lucene decides it is time for merging to be performed, segments are merged together by Lucene to create larger ones. This can be I/O demanding; however, it is needed to clean up some information because during that time some information that is not needed anymore is deleted; for example, the deleted documents. In addition to this, searching with the use of one larger segment is faster than searching against multiple smaller ones holding the same data.
Of course, the actual index created by Lucene is much more complicated and advanced, and consists of more than the terms, their counts, and documents, in which they are present. We would like to tell you about a few of these additional index pieces because even though they are internal, it is usually good to know about them, as they can be very useful.
Norms
A norm is a factor associated with each indexed document and stores normalization factors used to compute the score relative to the query. Norms are computed on the basis of index time boosts and are indexed along with the documents. With the use of norms, Lucene is able to provide an index time-boosting functionality at the cost of a certain amount of additional space needed for norms indexation and some amount of additional memory.
Term vectors
Term vectors are small inverted indices per document. They consist of pairs-a term and its frequency-and can optionally include information about the term position. By default, Lucene and Elasticsearch don't enable term vectors indexing, but some functionalities, such as the fast vector highlighting, require them to be present.
Posting formats
With the release of Lucene 4.0, the library introduced the so-called codec architecture, giving developers control over how the index files are written onto the disk. One of the parts of the index is the posting format, which stores fields, terms, documents, term positions and offsets, and, finally, the payloads (a byte array stored at an arbitrary position in the Lucene index, which can contain any information we want). Lucene contains different posting formats for different purposes; for example; one that is optimized for high cardinality fields such as the unique identifier.
Doc values
As we have already mentioned, the Lucene index is the so-called inverted index. However, for certain features, such as aggregations, such an architecture is not the best one. The mentioned functionality operates on the document level and not the term level because Elasticsearch needs to uninvert the index before calculations can be done. Because of that, doc values were introduced and an additional structure was used for sorting and aggregations. The doc values store uninverted data for a field that they are turned on for. Both Lucene and Elasticsearch allow us to configure the implementation used to store them, giving us the possibility of memory-based doc values, disk-based doc values, and a combination of the two. Doc values are default in Elasticsearch since the 2.x release.
Document analysis
When we index a document into Elasticsearch, it goes through an analysis phase which is necessary in order to create the inverted indexes. It is a series of steps performed by Lucene which are depicted in following image:
Analysis is done by the analyzer, which is built of a tokenizer and zero or more filters, and can also have zero or more character filters.
A tokenizer in Lucene is used to divide the text into tokens, which are basically terms with additional information, such as its position in the original text and its length. The result of the tokenizer work is a so-called token stream, where the tokens are put one by one and are ready to be processed by filters.
Apart from the tokenizer, the Lucene analyzer is built of zero or more filters that are used to process tokens in the token stream. For example, it can remove tokens from the stream, change them, or even produce new ones. There are numerous filters and you can easily create new ones. Some examples of filters are as follows:
- Lowercase filter: This makes all the tokens lowercase
- ASCII folding filter: This removes non-ASCII parts from tokens
- Synonyms filter: This is responsible for changing one token to another on the basis of synonym rules
- Multiple language stemming filters: These are responsible for reducing tokens (actually the text part that they provide) into their root or base forms, the stem
Filters are processed one after another, so we have almost unlimited analysis possibilities with adding multiple filters one after another.
The last thing is the character filtering, which is used before the tokenizer and is responsible for processing text before any analysis is done. One of the examples of the character filter is the HTML tags removal process.
This analysis phase is applied during query time also. However, you can also choose the other path and not analyze your queries. This is crucial to remember because some of the Elasticsearch queries are being analyzed and some are not. For example, the prefix
query is not analyzed and the match query is analyzed.
What you should remember about indexing and querying analysis is that the index should be matched by the query term. If they don't match, Lucene won't return the desired documents. For example, if you are using stemming and lowercasing during indexing, you need to be sure that the terms in the query are also lowercased and stemmed, or your queries will return no results at all.
Basics of the Lucene query language
Some of the query types provided by Elasticsearch support Apache Lucene query parser syntax. Because of this, it is crucial to understand the Lucene query language.
A query is divided by Apache Lucene into terms and operators. A term, in Lucene, can be a single word or a phrase (a group of words surrounded by double quote characters). If the query is set to be analyzed, the defined analyzer will be used on each of the terms that form the query.
A query can also contain Boolean operators that connect terms to each other forming clauses. The list of Boolean operators is as follows:
AND
: This means that the given two terms (left and right operand) need to match in order for the clause to be matched. For example, we would run a query, such asapache AND lucene
, to match documents with bothapache
andlucene
terms in a document field.OR
: This means that any of the given terms may match in order for the clause to be matched. For example, we would run a query, such asapache OR lucene
, to match documents withapache
orlucene
(or both) terms in a document field.NOT
: This means that in order for the document to be considered a match, the term appearing after theNOT
operator must not match. For example, we would run a querylucene NOT Elasticsearch
to match documents that contain thelucene
term, but not theElasticsearch
term in the document field.
In addition to these, we may use the following operators:
+
: This means that the given term needs to be matched in order for the document to be considered as a match. For example, in order to find documents that match thelucene
term and may match theapache
term, we would run a query such as+lucene apache
.-
: This means that the given term can't be matched in order for the document to be considered a match. For example, in order to find a document with thelucene
term, but not theElasticsearch
term, we would run a query such as+lucene -Elasticsearch
.
When not specifying any of the previous operators, the default OR
operator will be used.
In addition to all these, there is one more thing: you can use parentheses to group clauses together; for example, with something like the following query:
Elasticsearch AND (mastering OR book)
Querying fields
Of course, just like in Elasticsearch, in Lucene all your data is stored in fields that build the document. In order to run a query against a field, you need to provide the field name, add the colon character, and provide the clause that should be run against that field. For example, if you would like to match documents with the term Elasticsearch
in the title
field, you would run the following query:
title:Elasticsearch
You can also group multiple clauses. For example, if you would like your query to match all the documents having the Elasticsearch
term and the mastering book
phrase in the title
field, you could run a query like the following code:
title:(+Elasticsearch +"mastering book")
The previous query can also be expressed in the following way:
+title:Elasticsearch +title:"mastering book"
Term modifiers
In addition to the standard field query with a simple term or clause, Lucene allows us to modify the terms we pass in the query with modifiers. The most common modifiers, which you will be familiar with, are wildcards. There are two wildcards supported by Lucene, the ?
and *
terms. The first one will match any character and the second one will match multiple characters.
In addition to this, Lucene supports fuzzy and proximity searches with the use of the ~
character and an integer following it. When used with a single word term, it means that we want to search for terms that are similar to the one we've modified (the so-called fuzzy search). The integer after the ~
character specifies the maximum number of edits that can be done to consider the term similar. For example, if we would run a query, such as writer~2
, both the terms writer
and writers
would be considered a match.
When the ~
character is used on a phrase, the integer number we provide is telling Lucene how much distance between the words is acceptable. For example, let's take the following query:
title:"mastering Elasticsearch"
It would match the document with the title
field containing mastering Elasticsearch
, but not mastering book Elasticsearch
. However, if we ran a query, such as title:"mastering Elasticsearch"~2
, it would result in both example documents being matched.
We can also use boosting to increase our term importance by using the ^
character and providing a float number. Boosts lower than 1 would result in decreasing the document importance. Boosts higher than 1 would result in increasing the importance. The default boost value is 1
. Please refer to the The changed default text scoring in Lucene - BM25 section in Chapter 2, The Improved Query DSL, for further information on what boosting is and how it is taken into consideration during document scoring.
In addition to all these, we can use square and curly brackets to allow range searching. For example, if we would like to run a range search on a numeric field, we could run the following query:
price:[10.00 TO 15.00]
The preceding query would result in all documents with the price
field between 10.00
and 15.00
inclusive.
In case of string-based fields, we also can run a range query; for example name:[Adam TO Adria]
.
The preceding query would result in all documents containing all the terms between Adam
and Adria
in the name
field including them.
If you would like your range bound or bounds to be exclusive, use curly brackets instead of the square ones. For example, in order to find documents with the price
field between 10.00
inclusive and 15.00
exclusive, we would run the following query:
price:[10.00 TO 15.00}
If you would like your range bound from one side and not bound by the other, for example querying for documents with a price higher than 10.00,
we would run the following query:
price:[10.00 TO *]
Handling special characters
In case you want to search for one of the special characters (which are +
, -
, &&
, ||
, !
, (
, )
, { }
, [ ]
, ^
, "
, ~
, *
, ?
, :
, \
, /
), you need to escape it with the use of the backslash (\
) character. For example, to search for the abc"efg
term you need to do something like abc"efg
.
An overview of Elasticsearch
Although we've said that we expect the reader to be familiar with Elasticsearch, we would really like to give you a short introduction to the concepts of this great search engine.
As you probably know, Elasticsearch is a distributed full text search and analytic engine that is built on top of Lucene to build search and analysis-oriented applications. It was originally started by Shay Banon and published in February 2010. Since then, it has rapidly gained popularity within just a few years and has become an important alternative to other open source and commercial solutions. It is one of the most downloaded open source projects.
The key concepts
There are a few concepts that come with Elasticsearch, and their understanding is crucial to fully understand how Elasticsearch works and operates:
- Index: A logical namespace under which Elasticsearch stores data and may be built with more than one Lucene index using shards and replicas.
- Document: A document is a JSON object that contains the actual data in key value pairs. It is very important to understand that when a field is indexed for the first time into the index, Elasticsearch creates a data type for that field. Starting from version 2.x, a very strict type checking gets done.
- Type: A doc type in Elasticsearch represents a class of similar documents. A type consists of a name such as a user or a blog post, and a mapping including data types and the Lucene configurations for each field.
- Mapping: As already mentioned in the An overview of Lucene section, all documents are analyzed before being indexed. We can configure how the input text is divided into tokens, which tokens should be filtered out, or what additional processing, such as removing HTML tags, is needed. This is where mapping comes into play-it holds all the information about the analysis chain. Besides the fact that Elasticsearch can automatically discover a field type by looking at its value, in most cases we will want to configure the mappings ourselves to avoid unpleasant surprises.
- Node: A single instance of Elasticsearch running on a machine. Elasticsearch nodes can serve different purposes. Of course, Elasticsearch is designed to index and search our data, so the first type of node is the
data
node. Such nodes hold the data and search on them. The second type of node is themaster
node-a node that works as a supervisor of the cluster controlling other nodes' work. The third node type is theclient
node, which is used as a query router. The fourth type of node is thetribe
node, which was introduced in Elasticsearch 1.0. Thetribe
node can join multiple clusters and thus act as a bridge between them, allowing us to execute almost all Elasticsearch functionalities on multiple clusters just like we would by using a single cluster. Elasticsearch 5.0 has also introduced a new type of node called theingest
node, which can be used for data transformation before the data gets indexed. - Cluster: A cluster is a single name under which one or more nodes/instances of Elasticsearch are connected to each other.
- Shard: Shards are containers that can be stored on a single node or multiple nodes and are composed of Lucene segments. An index is divided into one or more shards to make the data distributable. For the index, shards once created cannot be increased or decreased.
Note
A shard can be either primary or secondary. A primary shard is the one where all the operations that change the index are directed. A secondary shard is the one that contains duplicate data of the primary shard and helps in quickly searching data as well as in high availability; in case the machine that holds the primary shard goes down, then the secondary shard becomes the primary shard automatically.
- Replica: A duplicate copy of the data living in a shard for high availability. Having a replica also provides a faster search experience.
Working of Elasticsearch
Elasticsearch uses the zen discovery module for cluster formation. In 1.x, multicast was the default discovery used in Elasticsearch, but in 2.x unicast became the default discovery type. Although, multicast was available in Elasticsearch 2.x as a plugin. Multicast support has completely been removed from Elasticsearch 5.0
When an Elasticsearch node starts, it performs discovery and searches for the list of unicast hosts (master eligible nodes), which are configured in the elasticsearch.yml
configuration file using the discovery.zen.ping.unicast.hosts
parameter. By default, the default list of unicast hosts is ["127.0.0.1", "[::1]"]
so that each node, when starting, does not form a cluster only with itself. We will have a detailed section on zen discovery and node configurations in Chapter 8, Elasticsearch Administration.