Elasticsearch 8.x Cookbook

Chapter 2: Managing Mappings

Mapping is a primary concept in Elasticsearch that defines how the search engine should process a document and its fields to be effectively used in search and aggregations.

Search engines perform the following two main operations:

Indexing: This action is used to receive a document, process it, and store it in an index.
Searching: This action is used to retrieve the data from the index based on a query.

These two operations are strictly connected; an error in the indexing step leads to unwanted or missing search results.

Elasticsearch, by default, has explicit mapping at the index level. When indexing, if a mapping is not provided, a default one is created and guesses the structure from the JSON data fields that the document is composed of. This new mapping is then automatically propagated to all the cluster nodes: it will begin part of the cluster's state.

The default type mapping has sensible default values, but when you want to change their behavior or customize several other aspects of indexing (object to special fields, storing, ignoring, completion, and so on), you need to provide a new mapping definition.

In this chapter, we'll look at all the possible mapping field types that document mappings are composed of.

In this chapter, we will cover the following recipes:

Using explicit mapping creation
Mapping base types
Mapping arrays
Mapping an object
Mapping a document
Using dynamic templates in document mapping
Managing nested objects
Managing a child document with a join field
Adding a field with multiple mappings
Mapping a GeoPoint field
Mapping a GeoShape field
Mapping an IP field
Mapping an Alias field
Mapping a Percolator field
Mapping the Rank Feature and Feature Vector fields
Mapping the Search as you type field
Using the Range Field type
Using the Flattened field type
Using the Point and Shape field types
Using the Dense Vector field type
Using the Histogram field type
Adding metadata to a mapping
Specifying different analyzers
Using index components and templates

Using explicit mapping creation

If we consider the index as a database in the SQL world, mapping is similar to the create table definition.

Elasticsearch can understand the structure of the document that you are indexing (reflection) and create the mapping definition automatically. This is called explicit mapping creation.

Getting ready

To execute the code in this recipe, you will need an up-and-running Elasticsearch installation, as described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.

To execute these commands, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar platforms. I suggest using the Kibana console to provide code completion and better character escaping for Elasticsearch.

To understand the examples and code in this recipe, basic knowledge of JSON is required.

How to do it…

You can explicitly create a mapping by adding a new document to Elasticsearch. For this, perform the following steps:

Create an index, as shown in the following code:
```
PUT test
```

The output will be as follows:

{ "acknowledged" : true, "shards_acknowledged" : true,
 "index" : "test" }

Put a document in the index, as shown in the following code:
```
PUT test/_doc/1
{"name":"Paul", "age":35}
```

The output will be as follows:

{
  "_index" : "test", "_id" : "1", "_version" : 1,
  "result" : "created",
  "_shards" : {"total" : 2, "successful" : 1, "failed" : 0 },
  "_seq_no" : 0,  "_primary_term" : 1
}

Get the mapping with the following code:
```
GET test/_mapping
```

The mapping that's auto-created by Elasticsearch should look as follows:

{
  "test" : {
    "mappings" : {
      "properties" : {
        "age" : { "type" : "long" },
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {"type" : "keyword", "ignore_above" : 256 }
} } } } } }

To delete the index, you can use the following command:
```
DELETE test
```

The output will be as follows:

{ "acknowledged" : true }

How it works…

The first command line (Step 1) creates an index where we can configure the mappings in the future, if required, and store documents in it.

The second command (Step 2) inserts a document in the index (we'll learn how to create the index in the Creating an index recipe of Chapter 3, Basic Operations, and record indexing in the Indexing a document recipe of Chapter 3, Basic Operations).

Elasticsearch reads all the default properties for the field of the mapping and starts to process them as follows:

If the field is already present in the mapping and the value of the field is valid (it matches the correct type), Elasticsearch does not need to change the current mappings.
If the field is already present in the mapping but the value of the field is of a different type, it tries to upgrade the field type (that is, from integer to long). If the types are not compatible, it throws an exception, and the indexing process fails.
If the field is not present, it tries to auto-detect the type of field. It updates the mappings with a new field mapping. (In the case of a null value, it skips the mapping update until it encounters a concrete type.)

There's more…

In Elasticsearch, every document has a unique identifier, called an ID for a single index, which is stored in the special _id field of the document.

The _id field can be provided at index time or can be assigned automatically by Elasticsearch if it is missing.

When a mapping type is created or changed, Elasticsearch automatically propagates mapping changes to all the nodes in the cluster so that all the shards are aligned to process that particular type.

In Elasticsearch 7.x, there was a default type (_doc): it was removed in Elasticsearch 8.x and above.

Mapping base types

Using explicit mapping makes it possible to start to quickly ingest the data using a schemaless approach without being concerned about field types. Thus, to achieve better results and performance in indexing, it's required to manually define a mapping.

Fine-tuning mapping brings some advantages, such as the following:

Reducing the index size on the disk (disabling functionalities for custom fields)
Indexing only interesting fields (general speed up)
Precooking data for fast search or real-time analytics (such as aggregations)
Correctly defining whether a field must be analyzed in multiple tokens or considered as a single token
Defining mapping types such as geo point, suggester, vectors, and so on

Elasticsearch allows you to use base fields with a wide range of configurations.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

To execute this recipe's examples, you will need to create an index with a test name, where you can put mappings, as explained in the Using explicit mapping creation recipe.

How to do it...

Let's use a semi real-world example of a shop order for our eBay-like shop:

First, we must define an order:

Figure 2.1 – Example of an order

Our order record must be converted into an Elasticsearch mapping definition, as follows:

PUT test/_mapping
{  "properties" : {
      "id" : {"type" : "keyword"},
      "date" : {"type" : "date"},
      "customer_id" : {"type" : "keyword"},
      "sent" : {"type" : "boolean"},
      "name" : {"type" : "keyword"},
      "quantity" : {"type" : "integer"},
      "price" : {"type" : "double"},
      "vat" : {"type" : "double", "index": false}
} }

Now, the mapping is ready to be put in the index. We will learn how to do this in the Putting a mapping in an index recipe of Chapter 3, Basic Operations.

How it works...

Field types must be mapped to one of the Elasticsearch base types, and options on how the field must be indexed need to be added.

The following table is a reference for the mapping types:

Figure 2.2 – Base type mapping

Depending on the data type, it's possible to give explicit directives to Elasticsearch when you're processing the field for better management. The most used options are as follows:

store (default false): This marks the field to be stored in a separate index fragment for fast retrieval. Storing a field consumes disk space but reduces computation if you need to extract it from a document (that is, in scripting and aggregations). The possible values for this option are true and false. They are always retuned as an array of values for consistency.

The stored fields are faster than others in aggregations.

index: This defines whether or not the field should be indexed. The possible values for this parameter are true and false. Index fields are not searchable (the default is true).
null_value: This defines a default value if the field is null.
boost: This is used to change the importance of a field (the default is 1.0).

boost works on a term level only, so it's mainly used in term, terms, and match queries.

search_analyzer: This defines an analyzer to be used during the search. If it's not defined, the analyzer of the parent object is used (the default is null).
analyzer: This sets the default analyzer to be used (the default is null).
norms: This controls the Lucene norms. This parameter is used to score queries better. If the field is only used for filtering, it's a best practice to disable it to reduce resource usage (true for analyzed fields and false for not_analyzed ones).
copy_to: This allows you to copy the content of a field to another one to achieve functionalities, similar to the _all field.
ignore_above: This allows you to skip the indexing string if it's bigger than its value. This is useful for processing fields for exact filtering, aggregations, and sorting. It also prevents a single term token from becoming too big and prevents errors due to the Lucene term's byte-length limit of 32,766. The maximum suggested value is 8191 (https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html).

There's more...

From Elasticsearch version 6.x onward, as shown in the Using explicit mapping creation recipe, the explicit inferred type for a string is a multifield mapping:

The default processing is text. This mapping allows textual queries (that is, term, match, and span queries). In the example provided in the Using explicit mapping creation recipe, this was name.
The keyword subfield is used for keyword mapping. This field can be used for exact term matching and aggregation and sorting. In the example provided in the Using explicit mapping creation recipe, the referred field was name.keyword.

Another important parameter, available only for text mapping, is term_vector (the vector of terms that compose a string). Please refer to the Lucene documentation for further details at https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/index/Terms.html.

term_vector can accept the following values:

no: This is the default value; that is, skip term vector.
yes: This is the store term vector.
with_offsets: This is the store term vector with a token offset (start, end position in a block of characters).
with_positions: This is used to store the position of the token in the term vector.
with_positions_offsets: This stores all the term vector data.
with_positions_payloads: This is used to store the position and payloads of the token in the term vector.
with_positions_offsets_payloads: This stores all the term vector data with payloads.

Term vectors allow fast highlighting but consume disk space due to storing additional text information. It's a best practice to only activate it in fields that require highlighting, such as title or document content.

Mapping arrays

Array or multi-value fields are very common in data models (such as multiple phone numbers, addresses, names, aliases, and so on), but they're not natively supported in traditional SQL solutions.

In SQL, multi-value fields require you to create accessory tables that must be joined to gather all the values, leading to poor performance when the cardinality of the records is huge.

Elasticsearch, which works natively in JSON, provides support for multi-value fields transparently.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.

How to do it…

To use an Array type in our mapping, perform the following steps:

Every field is automatically managed as an array. For example, to store tags for a document, the mapping would be as follows:

{  "properties" : {
      "name" : {"type" : "keyword"},
      "tag" : {"type" : "keyword", "store" : true},
      ...
}

This mapping is valid for indexing both documents. The following is the code for document1:
```
{"name": "document1", "tag": "awesome"}
```

The following is the code for document2:

{"name": "document2", "tag": ["cool", "awesome", "amazing"] }

How it works…

Elasticsearch transparently manages the array: there is no difference if you declare a single value or a multi-value due to its Lucene core nature.

Multi-values for fields are managed in Lucene, so you can add them to a document with the same field name. For people with a SQL background, this behavior may be quite strange, but this is a key point in the NoSQL world as it reduces the need for a join query and creates different tables to manage multi-values. An array of embedded objects has the same behavior as simple fields.

Mapping an object

The object type is one of the most common field aggregation structures in documental databases.

An object is a base structure (analogous to a record in SQL): in JSON types, they are defined as key/value pairs inside the {} symbols.

Elasticsearch extends the traditional use of objects (which are flat in DBMS), thus allowing for recursive embedded objects.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. Again, I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it…

We can rewrite the mapping code from the previous recipe using an array of items:

PUT test/_doc/_mapping
{ "properties" : {
      "id" : {"type" : "keyword"},
      "date" : {"type" : "date"},
      "customer_id" : {"type" : "keyword", "store" : true},
      "sent" : {"type" : "boolean"},
      "item" : {
        "type" : "object",
        "properties" : {
          "name" : {"type" : "text"},
          "quantity" : {"type" : "integer"},
          "price" : {"type" : "double"},
          "vat" : {"type" : "double"}
} } } }

How it works…

Elasticsearch speaks native JSON, so every complex JSON structure can be mapped in it.

When Elasticsearch is parsing an object type, it tries to extract fields and processes them as its defined mapping. If not, it learns the structure of the object using reflection.

The most important attributes of an object are as follows:

properties: This is a collection of fields or objects (we can consider them as columns in the SQL world).
enabled: This establishes whether or not the object should be processed. If it's set to false, the data contained in the object is not indexed and it cannot be searched (the default is true).
dynamic: This allows Elasticsearch to add new field names to the object using a reflection on the values of the inserted data. If it's set to false, when you try to index an object containing a new field type, it'll be rejected silently. If it's set to strict, when a new field type is present in the object, an error will be raised, skipping the indexing process. The dynamic parameter allows you to be safe about making changes to the document's structure (the default is true).

The most used attribute is properties, which allows you to map the fields of the object in Elasticsearch fields.

Disabling the indexing part of the document reduces the index size; however, the data cannot be searched. In other words, you end up with a smaller file on disk, but there is a cost in terms of functionality.

Mapping a document

The document mapping is also referred to as the root object. This has special parameters that control its behavior, and they are mainly used internally to do special processing, such as routing or time-to-live of documents.

In this recipe, we'll look at these special fields and learn how to use them.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.

How to do it…

We can extend the preceding order example by adding some of the special fields, like so:

PUT test/_mapping
{ "_source": { "store": true },
    "_routing": { "required": true },
    "_index": { "enabled": true },
    "properties": {} }

How it works…

Every special field has parameters and value options, such as the following:

_id: This allows you to index only the ID part of the document. All the ID queries will speed up using the ID value (by default, this is not indexed and not stored).
_index: This controls whether or not the index must be stored as part of the document. It can be enabled by setting the "enabled": true parameter (enabled=false is the default).
_source: This controls how the document's source is stored. Storing the source is very useful, but it's a storage overhead, so it is not required. Consequently, it's better to turn it off (enabled=true is the default).
_routing: This defines the shard that will store the document. It supports additional parameters, such as required (true/false). This is used to force the presence of the routing value, raising an exception if it's not provided.

Controlling how to index and process a document is very important and allows you to resolve issues related to complex data types.

Every special field has parameters to set particular configurations, and some of their behaviors could change in different releases of Elasticsearch.

Using dynamic templates in document mapping

In the Using explicit mapping creation recipe, we saw how Elasticsearch can guess the field type using reflection. In this recipe, we'll see how we can help it improve its guessing capabilities via dynamic templates.

The dynamic template feature is very useful. For example, it may be useful in situations where you need to create several indices with similar types because it allows you to move the need to define mappings from coded initial routines to automatic index-document creation. Typical usage is to define types for Logstash log indices.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.

How to do it…

We can extend the previous mapping by adding document-related settings, as follows:

PUT test/_mapping
{
    "dynamic_date_formats":["yyyy-MM-dd", "dd-MM-yyyy"],\
    "date_detection": true,
    "numeric_detection": true,
    "dynamic_templates":[
      {"template1":{
        "match":"*",
        "match_mapping_type": "long",
        "mapping": {"type":" {dynamic_type}", "store": true}
      }}    ],
    "properties" : {...}
}

How it works…

The root object (document) controls the behavior of its fields and all its children object fields. In document mapping, we can define the following:

date_detection: This allows you to extract a date from a string (true is the default).
dynamic_date_formats: This is a list of valid date formats. This is used if date_detection is active.
numeric_detection: This enables you to convert strings into numbers, if possible (false is the default).
dynamic_templates: This is a list of templates that are used to change the explicit mapping inference. If one of these templates is matched, the rules that have been defined in it are used to build the final mapping.

A dynamic template is composed of two parts: the matcher and the mapping.

To match a field to activate the template, you can use several types of matchers, such as the following:

match: This allows you to define a match on the field name. The expression is a standard GLOB pattern (http://en.wikipedia.org/wiki/Glob_(programming)).
unmatch: This allows you to define the expression to be used to exclude matches (optional).
match_mapping_type: This controls the types of the matched fields; for example, string, integer, and so on (optional).
path_match: This allows you to match the dynamic template against the full dot notation of the field; for example, obj1.*.value (optional).
path_unmatch: This will do the opposite of path_match, excluding the matched fields (optional).
match_pattern: This allows you to switch the matchers to regex (regular expression); otherwise, the glob pattern match is used (optional).

The dynamic template mapping part is a standard one but can use special placeholders, such as the following:

{name}: This will be replaced with the actual dynamic field name.
{dynamic_type}: This will be replaced with the type of the matched field.

The order of the dynamic templates is very important; only the first one that is matched is executed. It is good practice to order the ones with more strict rules first, and then the others.

There's more...

Dynamic templates are very handy when you need to set a mapping configuration to all the fields. This can be done by adding a dynamic template, similar to this one:

"dynamic_templates" : [
  { "store_generic" : {
      "match" : "*", "mapping" : { "store" : true }
} } ]

In this example, all the new fields, which will be added with explicit mapping, will be stored.

Managing nested objects

There is a special type of embedded object called a nested object. This resolves a problem related to Lucene's indexing architecture, in which all the fields of embedded objects are viewed as a single object (technically speaking, they are flattened). During the search, in Lucene, it is not possible to distinguish between values and different embedded objects in the same multi-valued array.

If we consider the previous order example, it's not possible to distinguish an item's name and its quantity with the same query since Lucene puts them in the same Lucene document object. We need to index them in different documents and then join them. This entire trip is managed by nested objects and nested queries.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it…

A nested object is defined as a standard object with the nested type.

Regarding the example in the Mapping an object recipe, we can change the type from object to nested, as follows:

PUT test/_mapping
{ "properties" : {
      "id" : {"type" : "keyword"},
      "date" : {"type" : "date"},
      "customer_id" : {"type" : "keyword"},
      "sent" : {"type" : "boolean"},
      "item" : {"type" : "nested",
        "properties" : {
            "name" : {"type" : "keyword"},
            "quantity" : {"type" : "long"},
            "price" : {"type" : "double"},
            "vat" : {"type" : "double"}
} } } }

How it works…

When a document is indexed, if an embedded object has been marked as nested, it's extracted by the original document before being indexed in a new external document and saved in a special index position near the parent document.

In the preceding example, we reused the mapping from the Mapping an object recipe, but we changed the type of the item from object to nested. No other action must be taken to convert an embedded object into a nested one.

The nested objects are special Lucene documents that are saved in the same block of data as its parent – this approach allows for fast joining with the parent document.

Nested objects are not searchable with standard queries, only with nested ones. They are not shown in standard query results.

The lives of nested objects are related to their parents: deleting/updating a parent automatically deletes/updates all the nested children. Changing the parent means Elasticsearch will do the following:

Mark old documents as deleted.
Mark all nested documents as deleted.
Index the new document version.
Index all nested documents.

There's more...

Sometimes, you must propagate information about the nested objects to their parent or root objects. This is mainly to build simpler queries about the parents (such as terms queries without using nested ones). To achieve this, two special properties of nested objects must be used:

include_in_parent: This makes it possible to automatically add the nested fields to the immediate parent.
include_in_root: This adds the nested object fields to the root object.

These settings add data redundancy, but they reduce the complexity of some queries, thus improving performance.

Managing a child document with a join field

In the previous recipe, we saw how it's possible to manage relationships between objects with the nested object type. The disadvantage of nested objects is their dependence on their parents. If you need to change the value of a nested object, you need to reindex the parent (this causes a potential performance overhead if the nested objects change too quickly). To solve this problem, Elasticsearch allows you to define child documents.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.

How to do it…

In the following example, we have two related objects: an Order and an Item.

Their UML representation is as follows:

Figure 2.3 – UML example of an Order/Item relationship

The final mapping should merge the field definitions of both Order and Item, as well as use a special field (join_field, in this example) that takes the parent/child relationship.

To use join_field, follow these steps:

First, we must define the mapping, as follows:

PUT test1/_mapping
{ "properties": {
    "join_field": {
      "type": "join", "relations": { "order": "item" }
    },
    "id": { "type": "keyword" },
    "date": { "type": "date" },
    "customer_id": { "type": "keyword" },
    "sent": { "type": "boolean" },
    "name": { "type": "text" },
    "quantity": { "type": "integer" },
    "vat": { "type": "double" }
} }

The preceding mapping is very similar to the one in the previous recipe.

If we want to store the joined records, we will need to save the parent first and then the children, like so:

PUT test/_doc/1?refresh
{ "id": "1", "date": "2018-11-16T20:07:45Z", "customer_id": "100", "sent": true, "join_field": "order" }
PUT test/_doc/c1?routing=1&refresh
 { "name": "tshirt", "quantity": 10, "price": 4.3, "vat": 8.5,
   "join_field": { "name": "item", "parent": "1" } }

The child item requires special management because we need to add routing with the parent (1 in the preceding example). Furthermore, we need to specify the parent name and its ID in the object.

How it works…

Mapping, in the case of multiple item relationships in the same index, needs to be computed as the sum of all the other mapping fields.

The relationship between objects must be defined in join_field.

There must only be a single join_field for mapping; if you need to provide a lot of relationships, you can provide them in the relations object.

The child document must be indexed in the same shard as the parent; so, when indexed, an extra parameter must be passed, which is routing (we'll learn how to do this in the Indexing a document recipe in Chapter 3, Basic Operations).

A child document doesn't need to reindex the parent document when we want to change its values. Consequently, it's fast in terms of indexing, reindexing (updating), and deleting.

There's more...

In Elasticsearch, we have different ways to manage relationships between objects, as follows:

Embedding with type=object: This is implicitly managed by Elasticsearch and it considers the embedding as part of the main document. It's fast, but you need to reindex the main document to change the value of the embedded object.
Nesting with type=nested: This allows you to accurately search and filter the parent by using nested queries on children. Everything works for the embedded object except for the query (you must use a nested query to search for them).
External children documents: Here, the children are the external document, with a join_field property to bind them to the parent. They must be indexed in the same shard as the parent. The join with the parent is a bit slower than the nested one. This is because the nested objects are in the same data block as the parent in the Lucene index and they are loaded with the parent; otherwise, the child document requires more read operations.

Choosing how to model the relationship between objects depends on your application scenario.

Tip

There is also another approach that can be used, but on big data documents, it creates poor performance – decoupling a join relationship. You do the join query in two steps: first, collect the ID of the children/other documents and then search for them in a field of their parent.

Elasticsearch 8.x Cookbook: Over 180 recipes to perform fast, scalable, and reliable searches for your enterprise , Fifth Edition

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs