An overview of Elasticsearch
This section gives a high-level overview of Elasticsearch and discusses some related full-text search products.
Learning more about Elasticsearch
Elasticsearch is a free and open source full-text search engine that is built on top of Apache Lucene. Out of the box, Elasticsearch supports horizontal scaling and data redundancy. Released in 2010, Elasticsearch quickly gained recognition in the full-text search space. Its scalability features helped the tool gain market share against similar technologies such as Apache Solr.
Elasticsearch is a persistent document store and retrieval system, and it is similar to a database. However, it is different from relational databases such as MySQL, PostgreSQL, and Oracle in many ways:
- Distributed: Elasticsearch stores data and executes queries across multiple data nodes. This improves scalability, reliability, and performance.
- Fault tolerant: Data is replicated across multiple nodes in an Elasticsearch cluster, so if one node goes down, data is still available.
- Full-text search: Elasticsearch is built on top of Lucene, a full-text search technology, allowing it to understand and search natural language text.
- JSON document store: Elasticsearch stores documents as JSON instead of as rows in a table.
- NoSQL: Elasticsearch uses a JSON-based query language as opposed to a sequel query language (SQL).
- Non-relational: Unlike relational databases, Elasticsearch doesn't support JOINS across tables.
- Analytics: Elasticsearch has built-in analytical capabilities, such as word aggregations, geospatial queries, and scripting language support.
- Dynamic Mappings: A mapping in Elasticsearch is analogous to a schema in the relational database world. If the data type for a document field isn't explicitly defined, Elasticsearch will dynamically assign a type to it.
Data distribution, redundancy, and fault tolerance
Figures 1.1 through 1.4 explain how Elasticsearch distributes data across multiple nodes and how it automatically recovers from node failures:
In this figure, we have an Elasticsearch cluster made up of three nodes: elasticsearch-node-01
, elasticsearch-node-02
, and elasticsearch-node-03
. Our data index, is broken into three pieces, called
shards. These shards are labeled 0
, 1
, and 2
. Each shard is replicated once; this means that there is a redundant copy of all shards. The cluster is colored green because the cluster is in good health; all data shards and replicas are available.
Let's say that the elasticsearch-node-03
host experiences a hardware failure and shuts down. The following figures show what happens to the cluster in this scenario:
Figure 1.2 shows elasticsearch-node-03
experiencing a failure, and the cluster entering a yellow
state. This state means that there is at least one copy of each shard active in the cluster, but not all shard replicas are active. In our case, a copy of the 1
and 2
shards were on the node that failed, elasticsearch-node-03
. A yellow
state also warns us that if there's another hardware failure, it's possible that not all data shards will be available.
When elasticsearch-node-03
goes down, Elasticsearch will automatically start rebuilding redundant copies of the 1
and 2
shards on the remaining nodes; in our case, this is elasticsearch-node-01
and elasticsearch-node-02
. This is shown in the following figure:
Once Elasticsearch finishes rebuilding the data replicas, the cluster enters a green
state once again. Now, all data and shards are available to query.
The cluster recovery process demonstrated in Figures 1.3 and 1.4 happens automatically in Elasticsearch. No extra configuration or user action is required.
Full-text search
Full-text search refers to running keyword queries against natural-language text documents. A document can be something, such as a newspaper article, a blog post, a forum post, or a tweet. In fact, many popular newspapers, forums, and social media websites, such as The New York Times, Stack Overflow, and Foursquare, use Elasticsearch.
Assume that we were to store the following text string in Elasticsearch:
We demand rigidly defined areas of doubt and uncertainty!
A user can find this document by searching Elasticsearch using keywords, such as demand or doubt. Elasticsearch also supports word stemming. This means that if we searched for the word define, Elasticsearch would still find this document because the root word of defined is define.
This piece of text, along with some additional metadata, may be stored as follows in Elasticsearch in the JSON format:
{ "text" : "We demand rigidly defined areas of doubt and uncertainty!", "author" : "Douglas Adams", "published" : "1979-10-12", "likes" : 583, "source" : "The Hitchhiker's Guide to the Galaxy", "tags" : ["science fiction", "satire"] }
If we let Elasticsearch dynamically assign a mapping (think schema) to this document, it would look like this:
{ "quote" : { "properties" : { "author" : { "type" : "string" }, "likes" : { "type" : "long" }, "published" : { "type" : "date", "format" : "strict_date_optional_time||epoch_millis" }, "source" : { "type" : "string" }, "tags" : { "type" : "string" }, "text" : { "type" : "string" } } } }
Note that Elasticsearch was able to pick up that the published
field looked like a date.
An Elasticsearch query that searches for this document looks like this:
{ "query" : { "query_string" : { "query" : "demand rigidly" } }, "size" : 10 }
Specifics about Elasticsearch mappings and the Search API are beyond the scope of this book, but you can learn more about them through the official Elasticsearch documentation at the following links:
- Elasticsearch Mappings: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
- Elasticsearch Search API: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html
Note
Elasticsearch should not be your primary data store. It does not provide guarantees, such as the Atomicity, Consistency, Isolation, and Durability (ACID) of a traditional SQL data store, nor the reliability guarantees of other NoSQL databases such as HBase or Cassandra. Even though Elasticsearch has built-in data redundancy and fault tolerance, it's best practice to archive your data in a separate data store in order to re-index data into Elasticsearch if needed.
Similar technologies
This section explains a few of the many open source full-text search engines available, and discusses how they match up to Elasticsearch.
Apache Lucene
Apache Lucene (https://lucene.apache.org/core/) is an open source full-text search Java library. As mentioned earlier, Lucene is Elasticsearch's underlying search technology. Lucene also provides Elasticsearch's analytics features such as text aggregations and geospatial search. Using Apache Lucene directly is a good choice if you perform full-text search in Java on a small scale, or are building your own full-text search engine.
The benefits of using Elasticsearch over Lucene are as follows:
- REST API instead of a Java API
- JSON document store
- Horizontal scalability, reliability, and fault tolerance
On the other hand, Lucene is much more lightweight and flexible to build custom applications that require full-text search integrated from the ground up.
Note
Lucene.NET is a popular .NET port of the library written in C#
Solr
Solr is another full-text search engine built on top of Apache Lucene. It has similar search, analytic, and scaling capabilities to Elasticsearch. For most applications that need a full-text search engine, choosing between Solr and Elasticsearch comes down to personal preference.
Ferret
Ferret is a full-text search engine for Ruby. It's similar to Lucene, but it is not as feature-rich. It's generally better used for Ruby applications that don't require the power (or complexity) of a search engine, such as Elasticsearch or Solr.