You're reading from Learning Elasticsearch Structured and unstructured data using distributed real-time search and analytics

Product type Paperback

Published in Jun 2017

Publisher Packt

ISBN-13 9781787128453

Length 404 pages

Edition 1st Edition

Tools

Elasticsearch

Concepts

Enterprise Search

Author (1):

Abhishek Andhavarapu

View More author details

How does search work?

In the previous section, we discussed how to create, update, and delete documents. In this section, we will briefly discuss how search works internally and explain the basic query APIs. Mostly, I want to talk about the inverted index and Apache Lucene. All the data in Elasticsearch is internally stored in Apache Lucene as an inverted index. Although data is stored in Apache Lucene, Elasticsearch is what makes it distributed and provides the easy-to-use APIs. We will discuss Search API in detail in Chapter 6, All About Search.

Importance of information retrieval

As the computation power is increasing and cost of storage is decreasing, the amount of day-to-day data we deal with is growing exponentially. But without a way to retrieve the information and to be able to query it, the information we collect doesn't help.

Information retrieval systems are very important to make sense of the data. Imagine how hard it would be to find some information on the Internet without Google or other search engines out there. Information is not knowledge without information retrieval systems.

Simple search query

Let's say we have a User table as shown here:

Id	Name	Age	Gender	Email
1	Luke	100	M	luke@gmail.com
2	Leia	100	F	leia@gmail.com

Now, we want to query for all the users with the name Luke. A SQL query to achieve this would be something like this:

select * from user where name like ‘%luke%’

To do a similar task in Elasticsearch, you can use the search API and execute the following command:

GET http://127.0.0.1:9200/chapter1/user/_search?q=name:luke

Let's inspect the request:

INDEX	chapter1
TYPE	user
FIELD	name

Just like you would get all the rows in the User table as a result of the SQL query, the response to the Elasticsearch query would be JSON documents:

{
   "id": 1,
   "name": "Luke",
   "age": 100,
   "gender": "M",
   "email": "luke@gmail.com"
 }

Querying using the URL parameters can be used for simple queries as shown above. For more practical queries, you should pass the query represented as JSON in the request body. The same query passed in the request body is shown here:

POST http://127.0.0.1:9200/chapter1/user/_search 
{
   "query": {
     "term": {
       "name": "luke"
     }
   }
 }

The Search API is very flexible and supports different kinds of filters, sort, pagination, and aggregations.

Inverted index

Before we talk more about search, I want to talk about the inverted index. Knowing about inverted index will help you understand the limitations and strengths of Elasticsearch compared with the traditional database systems out there. Inverted index at the core is how Elasticsearch is different from other NoSQL stores, such as MongoDB, Cassandra, and so on.

We can compare an inverted index to an old library catalog card system. When you need some information/book in a library, you will use the card catalog, usually at the entrance of the library, to find the book. An inverted index is similar to the card catalog. Imagine that you were to build a system like Google to search for the web pages mentioning your search keywords. We have three web pages with Yoda quotes from Star Wars, and you are searching for all the documents with the word fear.

Document1: Fear leads to anger

Document2: Anger leads to hate

Document3: Hate leads to suffering

In a library, without a card catalog to find the book you need, you would have to go to every shelf row by row, look at each book title, and see whether it's the book you need. Computer-based information retrieval systems do the same.

Without the inverted index, the application has to go through each web page and check whether the word exists in the web page. An inverted index is similar to the following table. It is like a map with the term as a key and list of the documents the term appears in as value.

Term	Document
Fear	1
Anger	1,2
Hate	2,3
Suffering	3
Leads	1,2,3

Once we construct an index, as shown in this table, to find all the documents with the term fear is now just a lookup. Just like when a library gets a new book, the book is added to the card catalog, we keep building an inverted index as we encounter a new web page. The preceding inverted index takes care of simple use cases, such as searching for the single term. But in reality, we query for much more complicated things, and we don’t use the exact words. Now let’s say we encountered a document containing the following:

Yosemite national park may be closed for the weekend due to forecast of substantial rainfall

We want to visit Yosemite National Park, and we are looking for the weather forecast in the park. But when we query for it in the human language, we might query something like weather in yosemite or rain in yosemite. With the current approach, we will not be able to answer this query as there are no common terms between the query and the document, as shown:

Document	Query
rainfall	rain

To be able to answer queries like this and to improve the search quality, we employ various techniques such as stemming, synonyms discussed in the following sections.

Stemming

Stemming is the process of reducing a derived word into its root word. For example, rain, raining, rained, rainfall has the common root word "rain". When a document is indexed, the root word is stored in the index instead of the actual word. Without stemming, we end up storing rain, raining, rained in the index, and search relevance would be very low. The query terms also go through the stemming process, and the root words are looked up in the index. Stemming increases the likelihood of the user finding what he is looking for. When we query for rain in yosemite, even though the document originally had rainfall, the inverted index will contain term rain.

We can configure stemming in Elasticsearch using Analyzers. We will discuss how to set up and configure analyzers in Chapter 3, Modeling Your Data and Document Relations.

Synonyms

Similar to rain and raining, weekend and sunday mean the same thing. The document might not contain Sunday, but if the information retrieval system can also search for synonyms, it will significantly improve the search quality. Human language deals with a lot of things, such as tense, gender, numbers. Stemming and synonyms will not only improve the search quality but also reduce the index size by removing the differences between similar words.

More examples:

Pen, Pen[s] -> Pen

Eat, Eating -> Eat

Phrase search

As a user, we almost always search for phrases rather than single words. The inverted index in the previous section would work great for individual terms but not for phrases. Continuing the previous example, if we want to query all the documents with a phrase anger leads to in the inverted index, the previous index would not be sufficient. The inverted index for terms anger and leads is shown below:

Term	Document
Anger	1,2
Leads	1,2,3

From the preceding table, the words anger and leads exist both in document1 and document2. To support phrase search along with the document, we also need to record the position of the word in the document. The inverted index with word position is shown here:

Term	Document
Fear	1:1
Anger	1:3, 2:1
Hate	2:3, 3:1
Suffering	3:3
Leads	1:2, 2:2, 3:2

Now, since we have the information regarding the position of the word, we can search if a document has the terms in the same order as the query.

Term	Document
anger	1:3, 2:1
leads	1:2, 2:2

Since document2 has anger as the first word and leads as the second word, the same order as the query, document2 would be a better match than document1. With the inverted index, any query on the documents is just a simple lookup. This is just an introduction to inverted index; in real life, it's much more complicated, but the fundamentals remain the same. When the documents are indexed into Elasticsearch, documents are processed into the inverted index.

Apache Lucene

Apache Lucene is one of the most matured implementations of the inverted index. Lucene is an open source full-text search library. It's very high performing, entirely written in Java. Any application that requires text search can use Lucene. It allows adding full-text search capabilities to any application. Elasticsearch uses Apache Lucene to manage and create its inverted index. To learn more about Apache Lucene, please visit http://lucene.apache.org/core/.

We will talk about how distributed search works in Elasticsearch in the next section.

The term index is used both by Apache Lucene (inverted index) and Elasticsearch index. For the remainder of the book, unless specified the term index refers to an Elasticsearch index.