You're reading from Elasticsearch 7.0 Cookbook Over 100 recipes for fast, scalable, and reliable search for your enterprise

Product type Paperback

Published in Apr 2019

Publisher Packt

ISBN-13 9781789956504

Length 724 pages

Edition 4th Edition

Languages

Java

Tools

Elasticsearch

Concepts

Enterprise Search

Author (1):

Alberto Paro

View More author details

Setting up an ingestion node

The main goals of Elasticsearch are indexing, searching, and analytics, but it's often required to modify or enhance the documents before storing them in Elasticsearch.

The following are the most common scenarios in this case:

Preprocessing the log string to extract meaningful data
Enriching the content of textual fields with Natural Language Processing (NLP) tools
Enriching the content using machine learning (ML) computed fields
Adding data modification or transformation during ingestion, such as the following:
- Converting IP in geolocalization
- Adding datetime fields at ingestion time
- Building custom fields (via scripting) at ingestion time

Getting ready

You need a working Elasticsearch installation, as described in the Downloading and installing Elasticsearch recipe, as well as a simple text editor to change configuration files.

How to do it…

To set up an ingest node, you need to edit the config/elasticsearch.yml file and set up the ingest property to true, as follows:

node.ingest: true

Every time you change your elasticsearch.yml file, a node restart is required.

How it works…

The default configuration for Elasticsearch is to set the node as an ingest node (refer to Chapter 12, Using the Ingest module, for more information on the ingestion pipeline).

As the coordinator node, using the ingest node is a way to provide functionalities to Elasticsearch without suffering cluster safety.

If you want to prevent a node from being used for ingestion, you need to disable it with node.ingest: false. It's a best practice to disable this in the master and data nodes to prevent ingestion error issues and to protect the cluster. The coordinator node is the best candidate to be an ingest one.

If you are using NLP, attachment extraction (via, attachment ingest plugin), or logs ingestion, the best practice is to have a pool of coordinator nodes (no master, no data) with ingestion active.

The attachment and NLP plugins in the previous version of Elasticsearch were available in the standard data node or master node. These give a lot of problems to Elasticsearch due to the following reasons:

High CPU usage for NLP algorithms that saturates all CPU on the data node, giving bad indexing and searching performances
Instability due to the bad format of attachment and/or Apache Tika bugs (the library used for managing document extraction)
NLP or ML algorithms require a lot of CPU or stress the Java garbage collector, decreasing the performance of the node

The best practice is to have a pool of coordinator nodes with ingestion enabled to provide the best safety for the cluster and ingestion pipeline.

There's more…

Having known about the four kinds of Elasticsearch nodes, you can easily understand that a waterproof architecture designed to work with Elasticsearch should be similar to this one: