Before getting into the wonderful world of the Elasticsearch query language, we would like to introduce you to the simple but pretty flexible URI request search, which allows us to use a simple Elasticsearch query combined with the Lucene query language. Of course, we will extend our search knowledge using Elasticsearch in Chapter 3, Searching Your Data, but for now we will stick to the simplest approach.
All queries in Elasticsearch are sent to the _search
endpoint. You can search a single index or multiple indices, and you can restrict your search to a given document type or multiple types. For example, in order to search our book's index, we will run the following command:
The results returned by Elasticsearch will include all the documents from our book's index (because no query has been specified) and should look similar to the following:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 6,
"max_score" : 1.0,
"hits" : [ {
"_index" : "books",
"_type" : "es",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"title" : "Elasticsearch Server Second Edition",
"published" : 2014
}
}, {
"_index" : "books",
"_type" : "es",
"_id" : "4",
"_score" : 1.0,
"_source" : {
"title" : "Mastering Elasticsearch Second Edition",
"published" : 2015
}
}, {
"_index" : "books",
"_type" : "solr",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"title" : "Solr Cookbook Third Edition",
"published" : 2015
}
}, {
"_index" : "books",
"_type" : "es",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"title" : "Elasticsearch Server",
"published" : 2013
}
}, {
"_index" : "books",
"_type" : "solr",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"title" : "Apache Solr 4 Cookbook",
"published" : 2012
}
}, {
"_index" : "books",
"_type" : "es",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"title" : "Mastering Elasticsearch",
"published" : 2013
}
} ]
}
}
As you can see, the response has a header that tells you the total time of the query and the shards used in the query process. In addition to this, we have documents matching the query—the top 10 documents by default. Each document is described by the index, type, identifier, score, and the source of the document, which is the original document sent to Elasticsearch.
We can also run queries against many indices. For example, if we had another index called clients
, we could also run a single query against these two indices as follows:
We can also run queries against all the data in Elasticsearch by omitting the index names completely or setting the queries to _all
:
In a similar manner, we can also choose the types we want to use during searching. For example, if we want to search only in the es
type in the book's index, we run a command as follows:
Please remember that, in order to search for a given type, we need to specify the index or multiple indices. Elasticsearch allows us to have quite a rich semantics when it comes to choosing index names. If you are interested, please refer to https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-index.html; however, there is one thing we would like to point out. When running a query against multiple indices, it may happen that some of them do not exist or are closed. In such cases, the ignore_unavailable
property comes in handy. When set to true
, it tells Elasticsearch to ignore unavailable or closed indices.
For example, let's try running the following query:
The response would be similar to the following one:
Now let's check what will happen if we add the ignore_unavailable=true
to our request and execute the following command:
In this case, Elasticsearch would return the results without any error.
Elasticsearch query response
Let's assume that we want to find all the documents in our book's index that contain the elasticsearch
term in the title field. We can do this by running the following query:
The response returned by Elasticsearch for the preceding request will be as follows:
The first section of the response gives us information about how much time the request took (the took
property is specified in milliseconds), whether it was timed out (the timed_out
property), and information about the shards that were queried during the request execution—the number of queried shards (the total property of the _shards
object), the number of shards that returned the results successfully (the successful property of the _shards
object), and the number of failed shards (the failed property of the _shards
object). The query may also time out if it is executed for a longer period than we want. (We can specify the maximum query execution time using the timeout parameter.) The failed shard means that something went wrong with that shard or it was not available during the search execution.
Of course, the mentioned information can be useful, but usually, we are interested in the results that are returned in the hits object. We have the total number of documents returned by the query (in the total
property) and the maximum score calculated (in the max_score
property). Finally, we have the hits
array that contains the returned documents. In our case, each returned document contains its index name (the _index
property), the type (the _type
property), the identifier (the _id
property), the score (the _score
property), and the _source
field (usually, this is the JSON object sent for indexing.
You may wonder why the query we've run in the previous section worked. We indexed the Elasticsearch term and ran a query for Elasticsearch and even though they differ (capitalization), the relevant documents were found. The reason for this is the analysis. During indexing, the underlying Lucene library analyzes the documents and indexes the data according to the Elasticsearch configuration. By default, Elasticsearch will tell Lucene to index and analyze both string-based data as well as numbers. The same happens during querying because the URI request query maps to the query_string
query (which will be discussed in Chapter 3, Searching Your Data), and this query is analyzed by Elasticsearch.
Let's use the indices-analyze API (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html). It allows us to see how the analysis process is done. With this, we can see what happened to one of the documents during indexing and what happened to our query phrase during querying.
In order to see what was indexed in the title field of the Elasticsearch server phrase, we will run the following command:
The response will be as follows:
You can see that Elasticsearch has divided the text into two terms—the first one has a token value of elasticsearch
and the second one has a token value of the server.
Now let's look at how the query text was analyzed. We can do this by running the following command:
The response of the request will look as follows:
We can see that the word is the same as the original one that we passed to the query. We won't get into the Lucene query details and how the query parser constructed the query, but in general the indexed term after the analysis was the same as the one in the query after the analysis; so, the document matched the query and the result was returned.
URI query string parameters
There are a few parameters that we can use to control URI query behavior, which we will discuss now. The thing to remember is that each parameter in the query should be concatenated with the &
character, as shown in the following example:
Please remember to enclose the URL of the request using the '
characters because, on Linux-based systems, the &
character will be analyzed by the Linux shell.
The q
parameter allows us to specify the query that we want our documents to match. It allows us to specify the query using the Lucene query syntax described in the Lucene query syntax
section later in this chapter. For example, a simple query would look like this: q=title:elasticsearch
.
Using the df
parameter, we can specify the default search field that should be used when no field indicator is used in the q
parameter. By default, the _all
field will be used. (This is the field that Elasticsearch uses to copy the content of all the other fields. We will discuss this in greater depth in Chapter 2, Indexing Your Data). An example of the df
parameter value can be df=title
.
The analyzer
property allows us to define the name of the analyzer that should be used to analyze our query. By default, our query will be analyzed by the same analyzer that was used to analyze the field contents during indexing.
The default operator property
The default_operator
property that can be set to OR
or AND
, allows us to specify the default Boolean operator used for our query (http://en.wikipedia.org/wiki/Boolean_algebra). By default, it is set to OR
, which means that a single query term match will be enough for a document to be returned. Setting this parameter to AND
for a query will result in returning the documents that match all the query terms.
If we set the explain
parameter to true, Elasticsearch will include additional explain
information with each document in the result—such as the shard from which the document was fetched and the detailed information about the scoring calculation (we will talk more about it in the Understanding the explain information section in Chapter 6, Make Your Search Better). Also remember not to fetch the explain information during normal search queries because it requires additional resources and adds performance degradation to the queries. For example, a query that includes explain information could look as follows:
The results returned by Elasticsearch for the preceding query would be as follows:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.70273256,
"hits" : [ {
"_shard" : 2,
"_node" : "v5iRsht9SOWVzu-GY-YHlA",
"_index" : "books",
"_type" : "solr",
"_id" : "2",
"_score" : 0.70273256,
"_source" : {
"title" : "Solr Cookbook Third Edition",
"published" : 2015
},
"_explanation" : {
"value" : 0.70273256,
"description" : "weight(title:solr in 0) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 0.70273256,
"description" : "fieldWeight in 0, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
} ]
}, {
"value" : 1.4054651,
"description" : "idf(docFreq=1, maxDocs=3)",
"details" : [ ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=0)",
"details" : [ ]
} ]
} ]
}
}, {
"_shard" : 3,
"_node" : "v5iRsht9SOWVzu-GY-YHlA",
"_index" : "books",
"_type" : "solr",
"_id" : "1",
"_score" : 0.5,
"_source" : {
"title" : "Apache Solr 4 Cookbook",
"published" : 2012
},
"_explanation" : {
"value" : 0.5,
"description" : "weight(title:solr in 1) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 0.5,
"description" : "fieldWeight in 1, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
} ]
}, {
"value" : 1.0,
"description" : "idf(docFreq=1, maxDocs=2)",
"details" : [ ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=1)",
"details" : [ ]
} ]
} ]
}
} ]
}
}
By default, for each document returned, Elasticsearch will include the index
name, the type
name, the document
identifier, score
, and the _source
field. We can modify this behavior by adding the fields
parameter and specifying a comma-separated list of field names. The field will be retrieved from the stored fields (if they exist; we will discuss them in Chapter 2, Indexing Your Data) or from the internal _source
field. By default, the value of the fields parameter is _source
. An example is: fields=title,priority
.
We can also disable the fetching of the _source
field by adding the _source
parameter with its value set to false
.
Using the sort
parameter, we can specify custom sorting. The default behavior of Elasticsearch is to sort the returned documents in descending order of the value of the _score
field. If we want to sort our documents differently, we need to specify the sort
parameter. For example, adding sort=published:desc
will sort the documents in descending order of published field. By adding the sort=published:asc
parameter, we will tell Elasticsearch to sort the documents on the basis of the published field in ascending order.
If we specify custom sorting, Elasticsearch will omit the _score
field calculation for the documents. This may not be the desired behavior in your case. If you want to still keep a track of the scores for each document when using a custom sort, you should add the track_scores=true
property to your query. Please note that tracking the scores when doing custom sorting will make the query a little bit slower (you may not even notice the difference) due to the processing power needed to calculate the score.
By default, Elasticsearch doesn't have timeout for queries, but you may want your queries to timeout after a certain amount of time (for example, 5 seconds). Elasticsearch allows you to do this by exposing the timeout parameter. When the timeout parameter is specified, the query will be executed up to a given timeout value and the results that were gathered up to that point will be returned. To specify a timeout of 5 seconds, you will have to add the timeout=5s
parameter to your query.
Elasticsearch allows you to specify the results window (the range of documents in the results list that should be returned). We have two parameters that allow us to specify the results window size: size
and from
. The size parameter defaults to 10 and defines the maximum number of results returned. The from
parameter defaults to 0 and specifies from which document the results should be returned. In order to return five documents starting from the 11th one, we will add the following parameters to the query: size=5&from=10
.
Limiting per-shard results
Elasticsearch allows us to specify the maximum number of documents that should be fetched from each shard using terminate_after
property and specifying the maximum number of documents. For example, if we want to get no more than 100 documents from each shard, we can add terminate_after=100
to our URI request.
Ignoring unavailable indices
When running queries against multiple indices, it is handy to tell Elasticsearch that we don't care about the indices that are not available. By default, Elasticsearch will throw an error if one of the indices is not available, but we can change this by simply adding the ignore_unavailable=true
parameter to our URI request.
The URI query allows us to specify the search type using the search_type
parameter, which defaults to query_then_fetch
. Two values that we can use here are: dfs_query_then_fetch
and query_then_fetch
. The rest of the search types available in older Elasticsearch versions are now deprecated or removed. We'll learn more about search types in the Understanding the querying process section of Chapter 3, Searching Your Data.
Lowercasing term expansion
Some queries, such as the prefix query, use query expansion. We will discuss this in the Query rewrite section in Chapter 4, Extending Your Querying Knowledge. We are allowed to define whether the expanded terms should be lowercased or not using the lowercase_expanded_terms
property. By default, the lowercase_expanded_terms
property is set to true
, which means that the expanded terms will be lowercased.
Wildcard and prefix analysis
By default, wildcard queries and prefix queries are not analyzed. If we want to change this behavior, we can set the analyze_wildcard
property to true
.
We thought that it would be good to know a bit more about what syntax can be used in the q
parameter passed in the URI query. Some of the queries in Elasticsearch (such as the one currently being discussed) support the Lucene query parser syntax—the language that allows you to construct queries. Let's take a look at it and discuss some basic features.
A query that we pass to Lucene is divided into terms and operators by the query parser. Let's start with the terms; you can distinguish them into two types—single terms and phrases. For example, to query for a book
term in the title
field, we will pass the following query:
To query for the elasticsearch book
phrase in the title field, we will pass the following query:
You may have noticed the name of the field in the beginning and in the term or the phrase later.
As we already said, the Lucene query syntax supports operators. For example, the +
operator tells Lucene that the given part must be matched in the document, meaning that the term we are searching for must present in the field in the document. The -
operator is the opposite, which means that such a part of the query can't be present in the document. A part of the query without the +
or -
operator will be treated as the given part of the query that can be matched but it is not mandatory. So, if we want to find a document with the book
term in the title field and without the cat
term in the description field, we send the following query:
We can also group multiple terms with parentheses, as shown in the following query:
We can also boost parts of the query (this increases their importance for the scoring algorithm —the higher the boost, the more important the query part is) with the ^
operator and the boost value after it, as shown in the following query:
These are the basics of the Lucene query language and should allow you to use Elasticsearch and construct queries without any problems. However, if you are interested in the Lucene query syntax and you would like to explore that in depth, please refer to the official documentation of the query parser available at http://lucene.apache.org/core/5_4_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html.