Manual index creation and mappings configuration
So, we have our ElasticSearch cluster up and running and we also know how to use ElasticSearch REST API to index our data, delete it, and retrieve it, although we still don't know the specifics. If you are used to SQL databases, you might know that before you can start putting the data there, you need to create a structure, which will describe what your data looks like. Although ElasticSearch is a schema-less search engine and can figure out the data structure on the fly, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you'll see how to create new indexes (and how to delete them) and how to create mappings that suit your needs and match your data structure.
Note
Please note that we didn't include all the information about the available types in this chapter and some features of ElasticSearch (such as nested type, parent-child handling, geographical points storing, and search) are described in the following chapters of this book.
Index
An index is a logical structure in ElasticSearch that holds your data. You can imagine it as a database table that has rows and columns. A row is a document we index and a column is a single field in the index. Your ElasticSearch cluster can have many indexes inside it running at the same time. But that's not all. Because a single index is made of shards, it can be scattered across multiple nodes in a single cluster. In addition to that, each shard can have a replica—which is an exact copy of a shard—and is used to throttle search performance as well as for data duplication in case of failures.
All the shards that an index is made up of are, in fact, Apache Lucene indexes, which are divided into types.
Types
In ElasticSearch, a single index can have multiple types of documents indexed—for example, you can store blog posts and blog users inside the same index, but with completely different structures using types.
Index manipulation
As we mentioned earlier, although ElasticSearch can do some operations for us, we would like to create the index ourselves. For the purpose of this chapter, we'll use the index named posts
to index the blog posts from our blogging platform. Without any more hesitation, we will send the following command to create an index:
curl –XPOST 'http://localhost:9200/posts'
We just told ElasticSearch that is installed on our local machine that we want to create the posts
index. If everything goes right, you should see the following response from ElasticSearch:
{"ok":true,"acknowledged":true}
But there is a problem; we forgot to provide the mappings, which are responsible for describing the index structure. What can we do? Because we have no data at all, we'll go for the simplest approach—we will just delete the index. To do that, we run a command similar to the preceding one, but instead of using the POST
HTTP method, we use DELETE
. So the actual command is as follows:
curl –XDELETE 'http://localhost:9200/posts'
And the response is very similar to what we got earlier:
{"ok":true,"acknowledged":true}
So now that we know what an index is, how to create it, and how to delete it, let's define the index structure.
Schema mapping
The schema mapping—or in short mappings—are used to define the index structure. As you recall, each index can have multiple types; but we will concentrate on a single type for now. We want to index blog posts that can have the following structure:
Unique identifier
Name
Publication date
Contents
So far, so good right? We decided that we want to store our posts in the posts
index and so we we'll define the post type to do that. In ElasticSearch, mappings are sent as JSON objects in a file. So, let's create a mappings file that will match the previously mentioned needs—we will call it posts.json
. Its contents are as follows:
{ "mappings": { "post": { "properties": { "id": {"type":"long", "store":"yes", "precision_step":"0" }, "name": {"type":"string", "store":"yes", "index":"analyzed" }, "published": {"type":"date", "store":"yes", "precision_step":"0" }, "contents": {"type":"string", "store":"no", "index":"analyzed" } } } } }
And now to create our posts
index with the preceding file, we need to run the following command:
curl -XPOST 'http://localhost:9200/posts' –d @posts.json
@posts.json
allows us to tell the cURL command that we want to send the contents of the posts.json
file.
Note
Please note that you can store your mappings and use a file named however you want.
And again, if everything goes well, we see the following response:
{"ok":true,"acknowledged":true}
We have our index structure and we can index our data, but we will take a pause now; we don't really know what the contents of the posts.json
file mean. So let's discuss some details about this file.
Type definition
As you can see, the contents of the posts.json
file are JSON objects and because of that, it starts and ends with curly brackets (if you want to learn more about JSON, please visit http://www.json.org/). All the type definitions inside the mentioned file are nested in the mappings
object. Inside the mappings
JSON object there can be multiple types defined. In our example, we have a single post
type. But for example, if you would also like to include the user
type, the file would look as follows:
{ "mappings": { "post": { "properties": { "id": { "type":"long", "store":"yes", "precision_step":"0" }, "name": { "type":"string", "store":"yes", "index":"analyzed" }, "published": { "type":"date", "store":"yes", "precision_step":"0" }, "contents": { "type":"string", "store":"no", "index":"analyzed" } } }, "user": { "properties": { "id": { "type":"long", "store":"yes", "precision_step":"0" }, "name": { "type":"string", "store":"yes", "index":"analyzed" } } } } }
You can see that each type is a JSON object and those are separated from each other by a comma character—like typical JSON structured data.
Fields
Each type is defined by a set of properties—fields that are nested inside the properties
object. So let's concentrate on a single field now, for example, the name
field, whose definition is as follows:
"contents": { "type":"string", "store":"yes", "index":"analyzed" }
So it starts with the name of the field, which is contents
in the preceding case. After the name of the field, we have an object defining the behavior of the field. Attributes are specific to the types of fields we are using and we will discuss them in the next section. Of course, if you have multiple fields for a single type (which is what we usually have), remember to separate them with a comma character.
Core types
Each field type can be specified to a specific core type provided by ElasticSearch. The core types in ElasticSearch are as follows:
String
Number
Date
Boolean
Binary
So now, let's discuss each of the core types available in ElasticSearch and the attributes it provides to define their behavior.
Common attributes
Before continuing with all the core type descriptions I would like to discuss some common attributes that you can use to describe all the types (except for the binary one).
index_name
: This is the name of the field that will be stored in the index. If this is not defined, the name will be set to the name of the object that the field is defined with. You'll usually omit this property.index
: This can take the valuesanalyzed
andno
. For the string-based fields, it can also be set tonot_analyzed
. If set toanalyzed
, the field will be indexed and thus searchable. If set tono
, you won't be able to search such a field. The default value isanalyzed
. In the case of the string-based fields, there is an additional option—not_analyzed
, which says that the field should be indexed but not processed by the analyzer. So, it is written in the index as it was sent to ElasticSearch and only the perfect match will be counted during a search.store
: This can take the valuesyes
andno
, and it specifies if the original value of the field should be written into the index. The default value isno
, which means that you can't return that field in the results (although if you use the_source
field, you can return the value even if it is not stored), but if you have it indexed you still can search on it.boost
: The default value of this attribute is1
. Basically, it defines how important the field is inside the document; the higher the boost, the more important are the values in the field.null_value
: This attribute specifies a value that should be written into the index if that field is not a part of an indexed document. The default behavior will just omit that field.include_in_all
: This attribute specifies if the field should be included in the_all
field. By default, if the_all
field is used, all the fields will be included in it. The_all
field will be described in more detail in Chapter 3, Extending Your Structure and Search.
String
String is the most basic text type, which allows us to store one or more characters inside it. A sample definition of such a field can be as follows:
"contents" : { "type" : "string", "store" : "no", "index" : "analyzed" }
In addition to the common attributes, the following ones can also be set for string-based fields:
term_vector
: This can take the valuesno
(the default one),yes
,with_offsets
,with_positions
, orwith_positions_offsets
. It defines whether the Lucene term vectors should be calculated for that field or not. If you are using highlighting, you will need to calculate term vectors.omit_norms
: This can take the valuetrue
orfalse
. The default value isfalse
. When this attribute is set totrue
, it disables the Lucene norms calculation for that field (and thus you can't use index-time boosting).omit_term_freq_and_positions
: This can take the valuetrue
orfalse
. The default value isfalse
. Set this attribute totrue
, if you want to omit term frequency and position calculation during indexing. (Deprecated since ElasticSearch 0.20).index_options
: This allows to set indexing options. The possible values aredocs
which affects in number of documents for terms to be indexed, freqs which results in indexing number of documents for terms and term frequencies and positions which results in the previously mentioned two and term positions. The default value isfreqs
. (Available since ElasticSearch 0.20.)analyzer
: This is the name of the analyzer used for indexing and searching. It defaults to the globally defined analyzer name.index_analyzer
: This is the name of the analyzer used for indexing.search_analyzer
: This is the name of the analyzer used for processing the part of the query string that is sent to that field.ignore_above
: This is the maximum size of the field. The rest of the fields beyond the specified value characters will be ignored. This attribute is useful if we are only interested in the first N characters of the field.
Number
This is the core type that gathers all the numeric field types available to be used. The following types are available in ElasticSearch:
byte
: A byte value; for example, 1short
: A short value; for example, 12integer
: An integer value; for example, 134long
: A long value; for example, 12345float
: A float value; for example, 12.23double
: A double value, for example, 12.23
A sample definition of a field based on one of the numeric types can be as follows:
"price" : { "type" : "float", "store" : "yes", "precision_step" : "4" }
In addition to the common attributes, the following ones can also be set for the numeric fields:
precision_step
: This is the number of terms generated for each value in a field. The lower the value, the higher the number of terms generated, resulting in faster range queries (but a higher index size). The default value is4
.ignore_malformed
: This can take the valuetrue
orfalse
. The default value isfalse
. It should be set totrue
in order to omit badly formatted values.
Date
This core type is designed to be used for date indexing. It follows a specific format that can be changed and is stored in UTC by default.
The default date format understood by ElasticSearch is quite universal and allows us to specify the date and optionally the time; for example, 2012-12-24T12:10:22
. A sample definition of a field based on the date type can be as follows:
"published" : { "type" : "date", "store" : "yes", "format" : "YYYY-mm-dd" }
A sample document that uses the preceding field can be as follows:
{ "name" : "Sample document", "published" : "2012-12-22" }
In addition to the common attributes, the following ones can also be set for the date type- based fields:
format
: This specifies the format of the date. The default value isdateOptionalTime
. For a full list of formats, please visit http://www.elasticsearch.org/guide/reference/mapping/date-format.html.precision_step
: This specifies the number of terms generated for each value in that field. The lower the value, the higher is the number of terms generated, resulting in faster range queries (but a higher index size). The default value is4
.ignore_malformed
: This can can take the valuetrue
orfalse
. The default value isfalse
. It should be set totrue
in order to omit badly formatted values.
Boolean
This is the core type that is designed to be used for indexing. The Boolean values can be true
or false
. A sample definition of a field based on the Boolean type can be as follows:
"allowed" : { "type" : "boolean" }
Binary
The binary field is a BASE64 representation of the binary data stored in the index. You can use it to store data that is normally written in binary form, like images. Fields based on this type are, by default, stored and not indexed. The binary type only supports the index_name
property. A sample field definition based on the binary field looks like the following:
"image" : { "type" : "binary" }
Multi fields
Sometimes you would like to have the same field values in two fields—for example, one for searching and one for faceting. There is a special type in ElasticSearch—multi_field
—that allows us to map several core types into a single field and have them analyzed differently. For example, if we would like to calculate faceting and search on our name field, we could define the following multi_field
:
"name": { "type": "multi_field", "fields": { "name": { "type" : "string", "index": "analyzed" }, "facet": { "type" : "string", "index": "not_analyzed" } } }
The preceding definition will create two fields, one that we could just refer to as name
and the second one that we would use as name.facet
. Of course, you don't have to specify two separate fields during indexing, a single one named name
is enough and ElasticSearch will do the rest.
Using analyzers
As we mentioned during the mappings for the fields based on the string type, we can specify the analyzer used. But what is an analyzer? It's a functionality that is used to analyze data or queries in a way we want them to be indexed or searched—for example, when we divide words on the basis of whitespaces and lowercase characters, we don't have to worry about users sending words in lower- or uppercases. ElasticSearch allows us to use different analyzers for index time and during query time, so we can choose how we want our data to be processed in each stage of the search. To use one of the analyzers, we just need to specify its name to the correct property of the field and that's all!
Out-of-the-box analyzers
ElasticSearch allows us to use one of the many analyzers defined by default. The following analyzers are available out of the box:
standard
: A standard analyzer that is convenient for most European languages (please refer to http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-analyzer.html for the full list of parameters).simple
: An analyzer that splits the provided value on non-letter characters and converts letters to lowercase.whitespace
: An analyzer that splits the provided value on the basis of whitespace characters.stop
: This is similar to a simple analyzer; but in addition to the simple analyzer functionality, it filters the data on the provided stop words set (please refer to http://www.elasticsearch.org/guide/reference/index-modules/analysis/stop-analyzer.html for the full list of parameters).keyword
: This is a very simple analyzer that just passes the provided value. You'll achieve the same by specifying that field asnot_analyzed
.pattern
: This is an analyzer that allows flexible text separation by the use of regular expressions (please refer to http://www.elasticsearch.org/guide/reference/index-modules/analysis/pattern-analyzer.html for the full list of parameters).language
: This is an analyzer that is designed to work with a specific language. The full list of languages supported by this analyzer can be found at http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer.html.snowball
: Ths is an analyzer similar to the standard one, but in addition, it provides a stemming algorithm (please refer to http://www.elasticsearch.org/guide/reference/index-modules/analysis/snowball-analyzer.html for the full list of parameters).
Defining your own analyzers
In addition to the analyzers mentioned previously, ElasticSearch allows us to define new ones. In order to do that, we need to add an additional section to our mappings file, the settings
section, which holds the required information for ElasticSearch during index creation. This is how we define our custom settings
section:
"settings" : { "index" : { "analysis": { "analyzer": { "en": { "tokenizer": "standard", "filter": [ "asciifolding", "lowercase", "ourEnglishFilter" ] } }, "filter": { "ourEnglishFilter": { "type": "kstem" } } } } }
As you can see, we specified that we want a new analyzer named en
to be present. Each analyzer is built from a single tokenizer and multiple filters. A complete list of default filters and tokenizers can be found at http://www.elasticsearch.org/guide/reference/index-modules/analysis/. As you can see, our en
analyzer includes the standard tokenizer and three filters: asciifolding
and lowercase
—which are available by default—and the ourEnglishFilter
, which is a filter that we have defined.
To define a filter, we need to provide its name, its type (the type
property), and a number of additional parameters required by that filter type. The full list of filter types available in ElasticSearch can be found at http://www.elasticsearch.org/guide/reference/index-modules/analysis/. That list is changing constantly, so I'll skip commenting on it.
So, the mappings with the analyzer defined would be as follows:
{ "settings" : { "index" : { "analysis": { "analyzer": { "en": { "tokenizer": "standard", "filter": [ "asciifolding", "lowercase", "ourEnglishFilter" ] } }, "filter": { "ourEnglishFilter": { "type": "kstem" } } } } }, "mappings" : { "post" : { "properties" : { "id": { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name": { "type" : "string", "store" : "yes", "index" : "analyzed", "analyzer": "en" } } } } }
Analyzer fields
An analyzer field (_analyzer
) allows us to specify a field value that will be used as the analyzer name for the document to which the field belongs. Imagine that you have some software running that detects the language the document is written in and you store that information in the language
field in the document. Additionally, you would like to use that information to choose the right analyzer. To do that, just add the following to your mappings file:
"_analyzer" : { "path" : "language" }
So the whole mappings file could be as follows:
{ "mappings" : { "post" : { "_analyzer" : { "path" : "language" }, "properties" : { "id": { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name": { "type" : "string", "store" : "yes", "index" : "analyzed" }, "language": { "type" : "string", "store" : "yes", "index" : "not_analyzed"} } } } }
However, please be advised that there has to be an analyzer defined with the same name as the value provided in the language
field.
Default analyzers
There is one more thing we should say about analyzers—the ability to specify the one that should be used by default if no analyzer is defined. This is done in the same way as configuring a custom analyzer in the settings
section of the mappings file, but instead of specifying a custom name for the analyzer, the default
keyword should be used. So to make our previously defined analyzer default, we can change the en
analyzer to the following:
{ "settings" : { "index" : { "analysis": { "analyzer": { "default": { "tokenizer": "standard", "filter": [ "asciifolding", "lowercase", "ourEnglishFilter" ] } }, "filter": { "ourEnglishFilter": { "type": "kstem" } } } }
Storing a document source
Sometimes, you may not want to store separate fields; instead, you may want to store the whole input JSON document. In fact, ElasticSearch does that by default. If you want to change that behavior and do not want to include the source of the document, you need to disable the _source
field. This is as easy as adding the following part to our type definition:
"_source" : { "enabled" : false }
So the whole mappings file would be as follows:
{ "mappings": { "post": { "_source": { "enabled": false }, "properties": { "id": {"type":"long", "store":"yes", "precision_step":"0" }, "name": {"type":"string", "store":"yes", "index":"analyzed" }, "published": {"type":"date", "store":"yes", "precision_step":"0" }, "contents": {"type":"string", "store":"no", "index":"analyzed" } } } } }
All field
Sometimes, it's handy to have some of the fields copied into one; instead of searching multiple fields, a general purpose field will be used for searching—for example, when you don't know which fields to search on. By default, ElasticSearch will include the values from all the text fields into the _all
field. On the other hand, you may want to disable such behavior. To do that we should add the following part to our type definition:
"_all" : { "enabled" : false }
So the whole mappings file would look like the following:
{ "mappings": { "post": { "_all": { "enabled": false }, "properties": { "id": {"type":"long", "store":"yes", "precision_step":"0" }, "name": {"type":"string", "store":"yes", "index":"analyzed" }, "published": {"type":"date", "store":"yes", "precision_step":"0" }, "contents": {"type":"string", "store":"no", "index":"analyzed" } } } } }
However, please remember that the _all
field will increase the size of the index, so it should be disabled if not needed.