You're reading from Elasticsearch Indexing How to Improve User's Search Experience

Product type Paperback

Published in Dec 2015

Publisher

ISBN-13 9781783987023

Length 176 pages

Edition 1st Edition

Tools

Elasticsearch

Concepts

Enterprise Search

Author (1):

Huseyin Akdogan

View More author details

Table of Contents (10) Chapters

Preface

1. Introduction to Efficient Indexing

2. What is an Elasticsearch Index FREE CHAPTER

3. Basic Concepts of Mapping

4. Analysis and Analyzers

5. Anatomy of an Elasticsearch Cluster

6. Improving Indexing Performance

7. Snapshot and Restore

8. Improving the User Search Experience

Index

Analysis

We mentioned earlier that all of Apache Lucene's data is stored in an inverted index. This transformation is required for successful response by Elasticsearch to search requests. The process of transforming this data is called analysis.

Elasticsearch has an index analysis module. It maps to the Lucene Analyzer. In general, analyzers are composed of a single Tokenizer and zero or more TokenFilters.

Note

Analysis modules and analyzers will be discussed in depth in Chapter 4, Analysis and Analyzers.

Elasticsearch provides a lot of character filters, tokenizers, and token filters. For example, a character filter may be used to strip out HTML markup and a token filter may be used to modify tokens (for example, lowercase). You can combine them to create custom analyzers or you can use its built-in analyzer.

Good understanding of the process of analysis is very important in terms of improving the user's search experience and relevant search results because Elasticsearch (actually Lucene) will use analyzer during indexing and query time.

Tip

It is crucial to remember that all Elasticsearch queries are not being analyzed.

Now let's examine the importance of the analyzer in terms of relevant search results with a simple scenario:

curl -XPOST localhost:9200/company/employee -d '{
  "firstname": "Joe Jeffers",
  "lastname": "Hoffman",
  "age": 30
}'
{"_index":"company","_type":"employee","_id":"AU7GIEQeR7spPlxvqlud","_version":1,"created":true}

We indexed an employee. His name is Joe Jeffers Hoffman, 30 years old. Let's search the employees that are named Joe in the company index now:

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match": {
      "firstname": "joe"
    }
  }
}'
{
   "took": 68,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.19178301,
      "hits": [
         {
            "_index": "company",
            "_type": "employee",
            "_id": "AU7GIEQeR7spPlxvqlud",
            "_score": 0.19178301,
            "_source": {
               "firstname": "Joe Jeffers",
               "lastname": "Hoffman",
               "age": 30
            }
         }
      ]
   }
}

All string type fields in the company index will be analyzed by a standard analyzer because employee types were created with dynamic mapping.

The standard analyzer is the default analyzer that Elasticsearch uses. It removes most punctuation and splits the text on word boundaries, as defined by the Unicode Consortium.

Note

If you want to have more information about the Unicode Consortium, please refer to http://www.unicode.org/reports/tr29/.

In this case, Joe Jeffers would be two tokens (Joe and Jeffers). To see how the standard analyzer works, run the following command:

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'Joe Jeffers'
{
  "tokens" : [ {
    "token" : "joe",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "jeffers",
    "start_offset" : 4,
    "end_offset" : 11,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

We searched the letters joe and the consequent document containing Joe Jeffers was returned to us because the standard analyzer had split the text on word boundaries and converted to lowercase. The standard analyzer is built using the Lower Case Token Filter along with other filters (the Standard Token Filter and Stop Token Filter, for example).

Now let's examine the following example:

curl -XDELETE localhost:9200/company
{"acknowledged":true}

curl -XPUT localhost:9200/company -d '{
  "mappings": {
    "employee": {
      "properties": {
        "firstname": {"type": "string", "index": "not_analyzed"}
      }
    }
  }
}'
{"acknowledged":true}

curl -XPOST localhost:9200/company/employee -d '{
  "firstname": "Joe Jeffers",
  "lastname": "Hoffman",
  "age": 30
}'
{"_index":"company","_type":"employee","_id":"AU7GOF2wR7spPlxvqmHY","_version":1,"created":true}

We deleted the company index created by dynamic mapping and recreated it with explicit mapping. This time, we used the not_analyzed value of the index option on the firstname field in the employee type. This means that the field is not analyzed at indexing time:

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match": {
      "firstname": "joe"
    }
  }
}'
{
   "took": 12,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 2,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   }
}

As you can see, Elasticsearch did not return a result to us with the match query because the firstname field is configured to the not_analyzed value. Therefore, Elasticsearch did not use an analyzer during indexing; the indexed value was exactly as specified. In other words, Joe Jeffers was a single token. Unless otherwise indicated, the match query uses the default search analyzer. Therefore, if you want a document to return to us with the match query without changing the analyzer in this example, we need to specify the exact value (paying attention to uppercase/lowercase):

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match" : {
        "firstname": "Joe Jeffers"
    }
  }
}'

The preceding command will return us the document we searched for. Now let's examine the following example:

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match_phrase_prefix": {
      "firstname": "Joe"
    }
  }
}'
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,****
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.30685282,
      "hits": [
         {
            "_index": "company",
            "_type": "employee",
            "_id": "AU7GOF2wR7spPlxvqmHY",
            "_score": 0.30685282,
            "_source": {
               "firstname": "Joe Jeffers",
               "lastname": "Hoffman",
               "age": 30
            }
         }
      ]
   }
}

As you can see, our searched document was returned to us although we did not specify the exact value (please note that we still use uppercase letters) because the match_phrase_prefix query analyzes the text and creates a phrase query out of the analyzed text. It allows for prefix matches on the last term in the text.

The rest of the chapter is locked

You're reading from Elasticsearch Indexing How to Improve User's Search Experience

Table of Contents (10) Chapters

Analysis

Note

Tip

Note

Authors (1)

Personalised recommendations for you

You're reading from Elasticsearch Indexing How to Improve User's Search Experience

Table of Contents (10) Chapters

Analysis

Note

Tip

Note

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you