You're reading from ElasticSearch Cookbook - Second Edition Over 130 advanced recipes to search, analyze, deploy, manage, and monitor data effectively with ElasticSearch

Product type Paperback

Published in Jan 2015

Publisher

ISBN-13 9781783554836

Length 472 pages

Edition 2nd Edition

Languages

Java

Tools

Elasticsearch

Concepts

Enterprise Search

Author (1):

Alberto Paro

View More author details

Table of Contents (14) Chapters

Preface

1. Getting Started FREE CHAPTER

2. Downloading and Setting Up

3. Managing Mapping

4. Basic Operations

5. Search, Queries, and Filters

6. Aggregations

7. Scripting

8. Rivers

9. Cluster and Node Monitoring

10. Java Integration

11. Python Integration

12. Plugin Development

Index

Managing your data

If you are going to use ElasticSearch as a search engine or a distributed data store, it's important to understand concepts of how ElasticSearch stores and manages your data.

Getting ready

To work with ElasticSearch data, a user must have basic concepts of data management and JSON data format, which is the lingua franca to work with ElasticSearch data and services.

How it works...

Our main data container is called index (plural indices) and it can be considered as a database in the traditional SQL world. In an index, the data is grouped into data types called mappings in ElasticSearch. A mapping describes how the records are composed (fields).

Every record that must be stored in ElasticSearch must be a JSON object.

Natively, ElasticSearch is a schema-less data store; when you enter records in it during the insert process it processes the records, splits it into fields, and updates the schema to manage the inserted data.

To manage huge volumes of records, ElasticSearch uses the common approach to split an index into multiple shards so that they can be spread on several nodes. Shard management is transparent to the users; all common record operations are managed automatically in the ElasticSearch application layer.

Every record is stored in only a shard; the sharding algorithm is based on a record ID, so many operations that require loading and changing of records/objects, can be achieved without hitting all the shards, but only the shard (and its replica) that contains your object.

The following schema compares ElasticSearch structure with SQL and MongoDB ones:

ElasticSearch	SQL	MongoDB
Index (Indices)	Database	Database
Shard	Shard	Shard
Mapping/Type	Table	Collection
Field	Field	Field
Object (JSON Object)	Record (Tuples)	Record (BSON Object)

There's more...

To ensure safe operations on index/mapping/objects, ElasticSearch internally has rigid rules about how to execute operations.

In ElasticSearch, the operations are divided into:

Cluster/index operations: All clusters/indices with active write are locked; first they are applied to the master node and then to the secondary one. The read operations are typically broadcasted to all the nodes.
Document operations: All write actions are locked only for the single hit shard. The read operations are balanced on all the shard replicas.

When a record is saved in ElasticSearch, the destination shard is chosen based on:

The id (unique identifier) of the record; if the id is missing, it is autogenerated by ElasticSearch
If routing or parent (we'll see it in the parent/child mapping) parameters are defined, the correct shard is chosen by the hash of these parameters

Splitting an index in shard allows you to store your data in different nodes, because ElasticSearch tries to balance the shard distribution on all the available nodes.

Every shard can contain up to 2^32 records (about 4.9 billion), so the real limit to a shard size is its storage size.

Shards contain your data and during search process all the shards are used to calculate and retrieve results. So ElasticSearch performance in big data scales horizontally with the number of shards.

All native records operations (such as index, search, update, and delete) are managed in shards.

Shard management is completely transparent to the user. Only an advanced user tends to change the default shard routing and management to cover their custom scenarios. A common custom scenario is the requirement to put customer data in the same shard to speed up his operations (search/index/analytics).

Best practices

It's best practice not to have a shard too big in size (over 10 GB) to avoid poor performance in indexing due to continuous merging and resizing of index segments.

It is also not good to over-allocate the number of shards to avoid poor search performance due to native distributed search (it works as map and reduce). Having a huge number of empty shards in an index will consume memory and increase the search times due to an overhead on network and results aggregation phases.