Introducing Elasticsearch 5.x
In 2015, Elasticsearch, after acquiring Kibana, Logstash, Beats, and Found, re-branded the company name as Elastic. According to Shay Banon, the name change is part of an initiative to better align the company with the broad solutions it provides: future products, and new innovations created by Elastic's massive community of developers and enterprises that utilize the ELK stack for everything from real-time search, to sophisticated analytics, to building modern data applications.
But having several products under one hood resulted in discord among them during the release process and started creating confusion for the users. This resulted in the ELK stack being renamed to Elastic Stack and the company decided to keep releasing all components of the Elastic Stack together. This is so that they will all share the same version number for all the products to keep speed with your deployments, simplify compatibility testing, and make it even easier for developers to add new functionality across the stack.
The very first GA release under Elastic stack is 5.0.0, which will be covered throughout this book. Further, Elasticsearch keeps pace with Lucene version releases to incorporate bug fixes and the latest features into Elasticsearch. Elasticsearch 5.0 is based on Lucene 6, which is a major release from Lucene with some awesome new features and a focus on improving the search speed. We will discuss Lucene 6 in upcoming chapters to let you know how Elasticsearch is going to have some awesome improvements, both from search and storage points of view.
Introducing new features in Elasticsearch
Elasticsearch 5.x has many improvements and has gone through a great refactoring, which caused removal/deprecation of some features. We will keep discussing the removed/improved/new features in upcoming chapters, but for now let's take an overview of the new and improved things in Elasticsearch.
New features in Elasticsearch 5.x
Following are some of the most important features introduced in Elasticsearch version 5.0:
- Ingest node: This node is a new type of node in Elasticsearch, which can be used for simple data transformation and enrichment before actual data indexing takes place. The best thing is that any node can be configured to act as an
ingest
node and it is very lighter across the board. You can avoid Logstash for these tasks because theingest
node is a Java based implementation of the Logstash filter and comes as a default in Elasticsearch itself. - Index shrinking: By design, once an index is created, there is no provision of reducing the number of shards for that index and this brings a lot of challenges since each shard consumes some resources. Although this design still remains same, to make life easier for users, Elasticsearch has introduced a new
_shrink
API to overcome this problem. This API allows you to shrink an existing index into a newer index with a fewer number of shards.Note
We will cover the
ingest
node andshrink
API in detail under Chapter 9, Data Transformation and Federated Search. - Painless scripting language: In Elasticsearch, scripting has always been a matter of concern because of its slowness and for security reasons. Elasticsearch 5.0 includes a new scripting language called Painless, which has been designed to be fast and secure. Painless is still going through lots of improvements to make it more awesome and easily adaptable. We will cover it under Chapter 3, Beyond Full Text Search.
- Instant aggregations: Queries have been completely refactored in 5.0; they are now parsed on the coordinating node and serialized to different nodes in a binary format. This allows Elasticsearch to be much more efficient, with more cache-able queries, especially on data separated into time-based indices. This will cause a significant speed up for aggregations.
- A new completion suggester: The
completion
suggester has undergone a complete rewrite. This means that the syntax and data structure for fields of type completion have changed, as have the syntax and response of thecompletion
suggester requests. Thecompletion
suggester is now built on top of the first iteration of Lucene's newsuggest
API. - Multi-dimensional points: This is one of the most exciting features of Lucene 6, which empowers Elasticsearch 5.0. It is built using the k-d tree geospatial data structure to offer a fast single- and multi-dimensional numeric range and a geospatial point-in-shape filtering. A multi-dimensional point helps in reducing disk storage, memory utilization, and faster searches.
- Delete by Query API: After much demand from the community, Elasticsearch has finally provided the ability to delete documents based on a matching query using the
_delete_by_query
REST endpoint.
New features in Elasticsearch 2.x
Apart from the features discussed just now, you can also benefit from all of the new features that came in Elasticsearch version 2.x. For those who have not had a look at the 2.x series, let's have a quick revamp of the new features which came with Elasticsearch under this series:
- Reindex API: In Elasticsearch, re-indexing of documents is almost needed by every user, under several scenarios. The
_reindex
API makes this task very easy and you do not need to worry about writing your own code to do the same. This API, at the simplest level, provides the ability to move data from one index to another but also provides a great control while re-indexing the documents, such as using scripts for data transformation and many other parameters. You can take a look at the reindex API at following URL https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-reindex.html. - Update by query: Similar to re-indexing requirements, a user also demands to easily update the documents in place, based on certain conditions, without re-indexing the data. Elasticsearch provided this feature using the
update_by_query
REST endpoint in version 2.x. - Tasks API: The task management API, which is exposed by the
_task
REST endpoint, is used for retrieving information about the currently executing tasks on one or more nodes in the cluster. The following examples show the usage of thetasks
API:
GET /_tasks GET /_tasks?nodes=nodeId1,nodeId2 GET /_tasks?nodes=nodeId1&actions=cluster;*
- Since each task has an ID, you can either wait for the completion of the task or cancel the task in the following way:
POST /_tasks/taskId1/_cancel
- Query profiler: The
Profile
API is an awesome tool to debug the queries and get the insights to know why a certain query is slow and take steps to improve it. This API was released in the 2.2.0 version and provides detailed timing information about the execution of individual components in a search request. You just need to sendprofile
astrue
with your query object to get this working for you. For example:
curl -XGET 'localhost:9200/_search' -d '{ "profile": true, "query" : { "match" : { "message" : "query profiling test" } } }'
The changes in Elasticsearch
The change list is very long and covering all the change details is out of the scope of this book, since most of the changes are internal level changes which a user should not be worried about. However, we will cover the most important changes an existing Elasticsearch user must know.
Although this book is based on Elasticsearch version 5.0, it is very important for the reader to get to know the changes being made between versions 1.x to 2.x. If you are new to Elasticsearch and are not aware about older versions, you can skip this section.
Changes between 1.x to 2.x
Elasticsearch version 2.x was focused on resiliency, reliability, simplification, and features. This release was based on Apache Lucene 5.x and specifically improves query execution and spatial search.
Version 2.x also delivers considerable improvements in index recovery. Historically, Elasticsearch index recovery was extremely painful, whether as part of node maintenance or an upgrade. The bigger the cluster, the bigger the headache. Node failures or a reboot can trigger a shard reallocation storm, and entire shards are sometimes copied over the network, despite having whole data. Users have also reported more than a day of recovery time to restart a single node.
With 2.x, recovery of existing replica shards became almost instant, and there is more lenient reallocation, which avoids reshuffling and makes rolling upgrades much easier and faster. Auto-regulating feedback loops in recent updates also eliminates past worries about merge throttling and related settings.
Elasticsearch 2.x also solved many of the known issues that plagued previous versions, including:
- Mapping conflicts (often yielding wrong results)
- Memory pressures and frequent garbage collections
- Low reliability of data
- Security breaches and split brains
- Slow recovery during node maintenance or rolling cluster upgrades
Mapping changes
Elasticsearch developers earlier assumed an index as a database and a type as a table. This allowed users to create multiple types inside the same index, but eventually became a major source of issues because of restrictions imposed by Lucene.
Fields that have the same name inside multiple types in a single index are mapped to a single field inside Lucene. Incorrect query outcomes and index corruption can result from a field in one document type being of an integer type while a field in another document type being of a string type. Several other issues can lead to mapping refactoring and major restrictions on handling mapping conflicts.
The following are the most significant changes imposed by Elasticsearch version 2.x:
- Field names must be referenced by full name.
- Field names cannot be referenced using a type name prefix.
- Field names can't contain dots.
- Type names can't start with a dot (
.percolator
is an exception) - Type names may not be longer than 255 characters.
- Types may no longer be deleted. So, if an index contains multiple types, you cannot delete any of the types from the index. The only solution is to create a new index and reindex the data.
index_analyzer
and_analyzer
parameters were removed from mapping definitions.- Doc values became default.
- A parent type can't pre-exist and must be included when creating child type.
- The
ignore_conflicts
option of the put mappings API got removed and conflicts cannot be ignored anymore. - Documents and mappings can't contain metadata fields that start with an underscore. So, if you have an existing document that contains a field with
_id
or_type
, it will not work in version 2.x. You need to reindex your documents after dropping those fields. - The default date format has changed from
date_optional_time
tostrict_date_optional_time
, which expects a four-digit year, and a two-digit month and day, (and optionally, a two-digit hour, minute, and second). So a dynamic index set as"2016-01-01"
will be stored inside Elasticsearch in"strict_date_optional_time||epoch_millis"
format. Please note that if you have been using Elasticsearch older than 1.x then your date range queries might get impacted because of this. For example, if in Elasticsearch 1.x, you have two documents indexed with one having the date as2017-02-28T12:00:00.000Z
and the second having the date as2017-03-01T11:59:59.000Z
, and if you are searching for documents between February 28, 2017 and March 1, 2017, the following query could return both the documents:
{ "range": { "created_at": { "gte": "2017-02-28", "lte": "2017-03-01" } } }
But in version 2.0 onwards, the same query must use the complete date time to get the same results. For example.
{ "range": { "created_at": { "gte": "2017-02-28T00:00:00.000Z", "lte": "2017-03-01T11:59:59.000Z" } } }
In addition, you can also use the date match operation in combination with date rounding to get the same results as following query:
{ "range": { "doc.created_at": { "lte": "2017-02-28||+1d/d", "gte": "2017-02-28", "format": "strict_date_optional_time" } } }
Query and filter changes
Prior to version 2.0.0, Elasticsearch had two different objects for querying data: queries and filters. Each was different in functionality and performance.
Queries were used to find out how relevant a document was to a particular query by calculating a score for each document. Filters were used to match certain criteria and were cacheable to enable faster execution. This means that if a filter matched 1,000 documents, Elasticsearch, with the help of bloom filters, would cache those documents in memory to retrieve them quickly in case the same filter was executed again.
However, with the release of Lucene 5.0, which is used by Elasticsearch version 2.0.0, both queries and filters became the same internal object, taking care of both document relevance and matching.
So, an Elasticsearch query that used to look like the following:
{ "filtered" : { "query": { query definition }, "filter": { filter definition } } }
It should now be written like this in version 2.x:
{ "bool" : { "must": { query definition }, "filter": { filter definition } } }
Additionally, the confusion caused by choosing between a bool
filter and an and
/ or
filter has been addressed with the elimination of and
/ or
filters, and replaced by the bool
query syntax in the preceding example. Rather than the unnecessary caching and memory requirements that often resulted from a wrong filter, Elasticsearch now tracks and optimizes frequently used filters and doesn't cache for segments with less than 10,000 documents or 3% of the index.
Security, reliability, and networking changes
Starting from 2.x, Elasticsearch now runs under the Java Security Manager enabled by default, which streamlines permissions after startup.
Elasticsearch has applied a durable-by-default approach to reliability and data duplication across multiple nodes. Documents are now synced to disk before indexing requests are acknowledged, and all file renames are now atomic to prevent partially written files.
On the networking side, based on extensive feedback from system administrators, Elasticsearch removed multicasting, and the default zen discovery has been changed to unicast. Elasticsearch also now binds to the localhost by default, preventing unconfigured nodes from joining public networks.
Monitoring parameter changes
Before version 2.0.0, Elasticsearch used the SIGAR library for operating system-dependent statistics. But SIGAR is no longer maintained, and it has been replaced in Elasticsearch by a reliance on stats provided by JVM. Accordingly, we see various changes in the monitoring parameters of the node info
and node stats
APIs:
network.*
has been removed fromnodes info
andnodes stats
.fs.*
.dev
andfs.*
.disk*
have been removed fromnodes stats
.os.*
has been removed fromnodes stats
, except foros.timestamp
,os.load_average
,os.mem.*
, andos.swap.*
.os.mem.total
andos.swap.total
have been removed fromnodes info
.- From the
_stats API, id_cache
parameter, which tells about parent-child data structure memory, usage has also been removed. Theid_cache
can now be fetched fromfielddata
.
Changes between 2.x to 5.x
Elasticsearch 2.x did not see too many releases in comparison to the 1.x series. The last release under 2.x was 2.3.4 and since then Elasticsearch 5.0 was released. The following are the most important changes an existing Elasticsearch user must know before adapting to the latest releases.
Note
Elasticsearch 5.x requires Java 8 so make sure to upgrade your Java versions before getting started with Elasticsearch.
Mapping changes
From a user's perspective, changes under mappings are the most important changes to know because a wrong mapping will disallow index creation or can lead to unwanted search. Here are the most important changes under this category that you need to know.
No more string fields
The string type is removed in favor of the text and keyword data type. In earlier versions of Elasticsearch, the default mapping for string based fields looked like the following:
{ "content" : { "type" : "string" } }
Starting from version 5.0, the same will be created using the following syntax:
{ "content" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } }
This allows you to perform a full-text search on the original field name and to sort and run aggregations on the sub-keyword field.
Note
Multi-fields are enabled by default for string-based fields and can cause extra overhead if a user is relying on dynamic mapping generation.
However, if you want to create specific mapping for string fields for full-text searches, it will be created as shown in the following example:
{ "content" : { "type" : "string" } }
Similarly, a not_analyzed
string field needs to be created using the following mapping:
{ "content" : { "type" : "keyword" } }
Note
On all field data types (except for the deprecated string field), the index property now only accepts true
/false
instead of not_analyzed
/no
.
Floats are default
Earlier, the default data type for decimal fields used to be double but now it has been changed to float.
Changes in numeric fields
Numeric fields are now indexed with a completely different data structure, called the BKD tree. This is expected to require less disk space and be faster for range queries. You can read the details at the following link:
Changes in geo_point fields
Similar to numeric fields, the geo_point
field now also uses the new BKD tree structure and field parameters for geo_point
fields are no longer supported: geohash
, geohash_prefix
, geohash_precision
, and lat_lon
. Geohashes are still supported from an API perspective, and can still be accessed using the .geohash
field extension, but they are no longer used to index geo point data.
For example, in previous versions of Elasticsearch, the mapping of a geo_point
field could look like the following:
"location":{ "type": "geo_point", "lat_lon": true, "geohash": true, "geohash_prefix": true, "geohash_precision": "1m" }
But, starting from Elasticsearch version 5.0, you can only create mapping of a geo_point
field as shown in the following:
"location":{ "type": "geo_point" }
Some more changes
The following are some very important additional changes you should be aware about:
- Removal of site plugins. The support of site plugins has been completely removed from Elasticsearch 5.0.
- Node clients are completely removed from Elasticsearch as they are considered really bad from a security perspective.
- Every Elasticsearch node, by default, binds to the localhost and if you change the bind address to some non-localhost IP address, Elasticsearch considers the node as production-ready and applies various Bootstrap checks when the Elasticsearch node starts. This is done to prevent your cluster from being blown away in future if you forget to allocate enough resources to Elasticsearch. The following are some of the Bootstrap checks Elasticsearch applies: maximum number of file descriptors check, maximum map count check, and heap size check. Please go to this URL to ensure that you have set all the parameters for Bootstrap checks to be passed https://www.elastic.co/guide/en/elasticsearch/reference/master/bootstrap-checks.html.
Note
Please note that if you are using OpenVZ virtualization on your servers, then you may find it difficult in setting the maximum map count for running Elasticsearch in the production mode, as this virtualization does not easily allow you to edit the kernel parameters. So you should either speak to your sysadmin to configure
vm.max_map_count
correctly, or move to a platform where you can set it, for example kvm VPS. _optimize
endpoint which was deprecated in 2.x is finally removed and has been replaced by the Force Merge API. For example, an optimize request in version 1.x...
curl -XPOST 'http://localhost:9200/test/_optimize?max_num_segments=5'
...should be converted to:
curl -XPOST 'http://localhost:9200/test/_forcemerge?max_num_segments=5'
In addition to these changes, some major changes have been done in search, settings, allocation, merge, and scripting modules, along with cat and Java APIs, which we will cover in subsequent chapters.