Analyzers constitute an important part of indexing. To understand what analyzers do, let's consider three documents:
- Document1 (tokens):
{ This , is , easy }
- Document2 (tokens):
{ This , is , fast }
- Document3 (tokens):
{ This , is , easy , and , fast }
Here, terms such as This
, is
, as well as and
are not relevant keywords. The chances of someone wanting to search for such words are very less, as these words don't contribute to the facts or context of the document. Hence, it's safe to avoid these words while indexing or rather you should avoid making these words searchable.
So, the tokenization would be as follows:
- Document1 (tokens):
{ easy }
- Document2 (tokens):
{ fast }
- Document3 (tokens):
{ easy , fast }
Words such as the
, or
, as well as and
are referred to as stop words. In most cases, these are for grammatical support and the chances that someone will search based on these words are slim. Also, the analysis and removal of stop words is very much language dependent. The process of selecting/transforming the searchable tokens from a document while indexing is called analyzing. The module that facilitates this is called an analyzer. The analyzer we just discussed is a stop word analyzer. By applying the right analyzer, you can minimize the number of searchable tokens and hence get better performance results.
There are three stages through which you can perform an analysis:
- Character filters: Filtering is done at character level before processing for tokens. A typical example of this is an HTML character filter. We might give an HTML to be indexed to Elasticsearch. In such instances, we can provide the HTML
CHAR
filter to do the work. - Tokenizers: The logic to break down text into tokens is depicted in this state. A typical example of this is whitespace tokenizers. Here, text is broken down into tokens by splitting the text based on the white space occurrence.
- Token filters: On top of the previous process, we apply a token filter. In this stage, we filter tokens to match our requirement. The length token filter is a typical token filter. A token filter of type
length
removes words which are too long or too short for the stream.
Here is a flowchart that depicts this process:
It should be noted that any number of such components can be incorporated in each stage. A combination of these components is called an analyzer. To create an analyzer out of the existing components, all we need to do is add the configuration to our Elasticsearch configuration file.
Types of character filters
The following are the different types of character filters:
- HTML stripper: This strips the HTML tags out of the text.
- Mapping char filter: Here, you can ask Elasticsearch to convert a set of characters or strings to another set of characters or strings. The options are as follows:
The following are different types of tokenizers:
- The whitespace tokenizer: A tokenizer of this type whitespace divides text at whitespace.
- The shingle tokenizer: There are instances where you want to search for text with two consecutive words, such as Latin America. In conventional searches, Latin would be a token and America would be a token, so you won't be able to boil down to the text that has these words next to each other. In the shingle tokenizer, n number of tokens are grouped into a single token. Token generation for a 2Gram tokenizer would be as follows:
- The lowercase tokenizer: This converts text into lowercase, thereby decreasing the index size.
The following are the different types of token filters:
- The stop word token filter: A set of words are recognized as stop words. This includes words like "is", "the", as well as "and" that don't add facts to the statement, but support the statement grammatically. A stop word token filter removes the stop words and hence helps to conduct more meaningful and efficient searches.
- The length token filter: With this, we can filter out tokens that have length greater than a configured value.
- The stemmer token filter: Stemming is an interesting concept. There are words such as "learn", "learning", "learnt", and so on that refer to the same word, but then are in different tenses. Here, we only need to index the actual word "learn" for any of its tenses. This is what a stemmer token filter does. It translates different tenses of the same word to the actual word.
Creating your own analyzer
Now, let's create our own analyzer and apply it on an index. I want to make an analyzer that strips out HTML tags before indexing. Also, there should not be any differentiation between lowercase and uppercase while searching. In short, the search is case insensitive. We are not interested in searching words such as "is" and "the", which are stop words. Also, we are not interested in words that have more than 900 characters. The following are the settings that you need to paste in the config/Elasticsearch.yml
file to create this analyzer:
Here, I named my analyzer myCustomAnalyzer
. By adding the character filter html_strip
, all HTML tags are removed out of the stream. A filter called stopWord
is created, where we define the stop words. If we don't mention the stop words, those are taken from the default set. The smallLetter
tokenizer removes all the words that have more than 900 characters.
A combination of character filters, token filters, and tokenizers is called an analyzer. You can make your own analyzer using these building blocks, but then, there are readymade analyzers that work well in most of the use cases. A Snowball Analyzer is an analyzer of the type snowball
that uses the standard tokenizer with the standard filter, lowercase filter, stop filter, and snowball filter, which is a stemming filter.
Here is how you can pass the analyzer setting to Elasticsearch:
Having understood how we can create an index and define field mapping with the analyzers, we shall go ahead and index some Wikipedia documents. For quick demonstration purpose, I have created a simple Python script to make some JSON documents. I am trying to create corresponding JSON files for the wiki pages for the following countries:
- China
- India
- Japan
- The United States
- France
Here is the script written in Python if you want to use it. This takes as input two command-line arguments: the first one is the title of the page and the second is the link:
Let's assume the name of the Python file is json_generator.py
. The following is how we execute it:
Now, we have a JSON file called France.json
that has a sample data we are looking for.
I assume that you generated JSON files for each country that we mentioned. As seen earlier, indexing a document once it is created is simple. Using the script shown next, I created the index and defined the mappings:
Once this is done, documents can be indexed like this. I assume that you have the file India.json
. You can index it as:
Index all the documents likewise.