Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
ElasticSearch Cookbook

You're reading from   ElasticSearch Cookbook As a user of ElasticSearch in your web applications you'll already know what a powerful technology it is, and with this book you can take it to new heights with a whole range of enhanced solutions from plugins to scripting.

Arrow left icon
Product type Paperback
Published in Dec 2013
Publisher Packt
ISBN-13 9781782166627
Length 422 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Alberto Paro Alberto Paro
Author Profile Icon Alberto Paro
Alberto Paro
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Getting Started FREE CHAPTER 2. Downloading and Setting Up ElasticSearch 3. Managing Mapping 4. Standard Operations 5. Search, Queries, and Filters 6. Facets 7. Scripting 8. Rivers 9. Cluster and Nodes Monitoring 10. Java Integration 11. Python Integration 12. Plugin Development Index

Managing your data

Unless you are using ElasticSearch as a search engine or a distributed data store, it's important to understand concepts on how ElasticSearch stores and manages your data.

Getting ready

To work with ElasticSearch data, a user must know basic concepts of data management and JSON that is the "lingua franca" for working with ElasticSearch data and services.

How it works...

Our main data container is called index (plural indices) and it can be considered as a database in the traditional SQL world. In an index, the data is grouped in data types called mappings in ElasticSearch. A mapping describes how the records are composed (called fields).

Every record, that must be stored in ElasticSearch, must be a JSON object.

Natively, ElasticSearch is a schema-less datastore. When you put records in it, during insert it processes the records, splits them into fields, and updates the schema to manage the inserted data.

To manage huge volumes of records, ElasticSearch uses the common approach to split an index into many shards so that they can be spread on several nodes. The shard management is transparent in usage—all the common record operations are managed automatically in the ElasticSearch application layer.

Every record is stored in only one shard. The sharding algorithm is based on record ID, so many operations that require loading and changing of records can be achieved without hitting all the shards.

The following schema compares ElasticSearch structure with SQL and MongoDB ones:

ElasticSearch

SQL

MongoDB

Index (Indices)

Database

Database

Shard

Shard

Shard

Mapping/Type

Table

Collection

Field

Field

Field

Record (JSON object)

Record (Tuples)

Record (BSON object)

There's more...

ElasticSearch, internally, has rigid rules about how to execute operations to ensure safe operations on index/mapping/records. In ElasticSearch, the operations are divided as follows:

  • Cluster operations: At cluster level all write ones are locked, first they are applied to the master node and then to the secondary one. The read operations are typically broadcasted.
  • Index management operations: These operations follow the cluster pattern.
  • Record operations: These operations are executed on single documents at shard level.

When a record is saved in ElasticSearch, the destination shard is chosen based on the following factors:

  • The ID (unique identifier) of the record. If the ID is missing, it is autogenerated by ElasticSearch.
  • If the routing or parent (covered while learning the parent/child mapping) parameters are defined, the correct shard is chosen by the hash of these parameters.

Splitting an index into shards allows you to store your data in different nodes, because ElasticSearch tries to do shard balancing.

Every shard can contain up to 2^32 records (about 4.2 billion records), so the real limit to shard size is its storage size.

Shards contain your data and during search process all the shards are used to calculate and retrieve results. ElasticSearch performance in big data scales horizontally with the number of shards.

All native records operations (such as index, search, update, and delete) are managed in shards.

The shard management is completely transparent to the user. Only an advanced user tends to change the default shard routing and management to cover their custom scenarios. A common custom scenario is the requirement to put customer data in the same shard to speed up his/her operations (search/index/analytics).

Best practice

It's best practice not to have a too big shard (over 10 GB) to avoid poor performance in indexing due to continuous merge and resizing of index segments.

It's not good to oversize the number of shards to avoid poor search performance due to native distributed search (it works as MapReduce). Having a huge number of empty shards in an index consumes only memory.

You have been reading a chapter from
ElasticSearch Cookbook
Published in: Dec 2013
Publisher: Packt
ISBN-13: 9781782166627
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime