Analysis
We mentioned earlier that all of Apache Lucene's data is stored in an inverted index. This transformation is required for successful response by Elasticsearch to search requests. The process of transforming this data is called analysis.
Elasticsearch has an index analysis module. It maps to the Lucene Analyzer. In general, analyzers are composed of a single Tokenizer and zero or more TokenFilters.
Note
Analysis modules and analyzers will be discussed in depth in Chapter 4, Analysis and Analyzers.
Elasticsearch provides a lot of character filters, tokenizers, and token filters. For example, a character filter may be used to strip out HTML markup and a token filter may be used to modify tokens (for example, lowercase). You can combine them to create custom analyzers or you can use its built-in analyzer.
Good understanding of the process of analysis is very important in terms of improving the user's search experience and relevant search results because Elasticsearch (actually Lucene) will use analyzer during indexing and query time.
Tip
It is crucial to remember that all Elasticsearch queries are not being analyzed.
Now let's examine the importance of the analyzer in terms of relevant search results with a simple scenario:
curl -XPOST localhost:9200/company/employee -d '{ "firstname": "Joe Jeffers", "lastname": "Hoffman", "age": 30 }' {"_index":"company","_type":"employee","_id":"AU7GIEQeR7spPlxvqlud","_version":1,"created":true}
We indexed an employee. His name is Joe Jeffers Hoffman, 30 years old. Let's search the employees that are named Joe in the company index now:
curl -XGET localhost:9200/company/_search?pretty -d '{ "query": { "match": { "firstname": "joe" } } }' { "took": 68, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.19178301, "hits": [ { "_index": "company", "_type": "employee", "_id": "AU7GIEQeR7spPlxvqlud", "_score": 0.19178301, "_source": { "firstname": "Joe Jeffers", "lastname": "Hoffman", "age": 30 } } ] } }
All string type fields in the company index will be analyzed by a standard analyzer because employee types were created with dynamic mapping.
The standard analyzer is the default analyzer that Elasticsearch uses. It removes most punctuation and splits the text on word boundaries, as defined by the Unicode Consortium.
Note
If you want to have more information about the Unicode Consortium, please refer to http://www.unicode.org/reports/tr29/.
In this case, Joe Jeffers
would be two tokens (Joe
and Jeffers
). To see how the standard analyzer works, run the following command:
curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'Joe Jeffers' { "tokens" : [ { "token" : "joe", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "jeffers", "start_offset" : 4, "end_offset" : 11, "type" : "<ALPHANUM>", "position" : 2 } ] }
We searched the letters joe
and the consequent document containing Joe Jeffers
was returned to us because the standard analyzer had split the text on word boundaries and converted to lowercase. The standard analyzer is built using the Lower Case Token Filter along with other filters (the Standard Token Filter and Stop Token Filter, for example).
Now let's examine the following example:
curl -XDELETE localhost:9200/company {"acknowledged":true} curl -XPUT localhost:9200/company -d '{ "mappings": { "employee": { "properties": { "firstname": {"type": "string", "index": "not_analyzed"} } } } }' {"acknowledged":true} curl -XPOST localhost:9200/company/employee -d '{ "firstname": "Joe Jeffers", "lastname": "Hoffman", "age": 30 }' {"_index":"company","_type":"employee","_id":"AU7GOF2wR7spPlxvqmHY","_version":1,"created":true}
We deleted the company index created by dynamic mapping and recreated it with explicit mapping. This time, we used the not_analyzed
value of the index
option on the firstname
field in the employee
type. This means that the field is not analyzed at indexing time:
curl -XGET localhost:9200/company/_search?pretty -d '{ "query": { "match": { "firstname": "joe" } } }' { "took": 12, "timed_out": false, "_shards": { "total": 5, "successful": 2, "failed": 0 }, "hits": { "total": 0, "max_score": null, "hits": [] } }
As you can see, Elasticsearch did not return a result to us with the match query because the firstname
field is configured to the not_analyzed
value. Therefore, Elasticsearch did not use an analyzer during indexing; the indexed value was exactly as specified. In other words, Joe Jeffers
was a single token. Unless otherwise indicated, the match query uses the default search analyzer. Therefore, if you want a document to return to us with the match query without changing the analyzer in this example, we need to specify the exact value (paying attention to uppercase/lowercase):
curl -XGET localhost:9200/company/_search?pretty -d '{ "query": { "match" : { "firstname": "Joe Jeffers" } } }'
The preceding command will return us the document we searched for. Now let's examine the following example:
curl -XGET localhost:9200/company/_search?pretty -d '{ "query": { "match_phrase_prefix": { "firstname": "Joe" } } }' { "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5,**** "failed": 0 }, "hits": { "total": 1, "max_score": 0.30685282, "hits": [ { "_index": "company", "_type": "employee", "_id": "AU7GOF2wR7spPlxvqmHY", "_score": 0.30685282, "_source": { "firstname": "Joe Jeffers", "lastname": "Hoffman", "age": 30 } } ] } }
As you can see, our searched document was returned to us although we did not specify the exact value (please note that we still use uppercase letters) because the match_phrase_prefix
query analyzes the text and creates a phrase query out of the analyzed text. It allows for prefix matches on the last term in the text.