Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
ElasticSearch Blueprints

You're reading from   ElasticSearch Blueprints A practical project-based guide to generating compelling search solutions using the dynamic and powerful features of Elasticsearch

Arrow left icon
Product type Paperback
Published in Jul 2015
Publisher Packt
ISBN-13 9781783984923
Length 192 pages
Edition 1st Edition
Arrow right icon
Toc

Table of Contents (10) Chapters Close

Preface 1. Google-like Web Search 2. Building Your Own E-Commerce Solution FREE CHAPTER 3. Relevancy and Scoring 4. Managing Relational Content 5. Analytics Using Elasticsearch 6. Improving the Search Experience 7. Spicing Up a Search Using Geo 8. Handling Time-based Data Index

Communicating with the Elasticsearch server

cURL will be our tool of choice that we will use to communicate with Elasticsearch. Elasticsearch follows a REST-like protocol for its exposed web API. Some of its features are as follows:

  • PUT: The HTTP method PUT is used to send configurations to Elasticsearch.
  • POST: The HTTP method POST is used to create new documents or to perform a search operation. While successful indexing of documents is done using POST, Elasticsearch provides you with a unique ID that points to the index file.
  • GET: The HTTP method GET is used to retrieve an already indexed document. Each document has a unique ID called a doc ID (short form for document's ID). When we index a document using POST, it provides a document ID, which can be used to retrieve the original document.
  • DELETE: The HTTP method DELETE is used to delete documents from the Elasticsearch index. Deletion can be performed based on a search query or directly using the document ID.

To specify the HTTP method in cURL, you can use the -X option, for example, CURL -X POST http://localhost/. JSON is the data format used to communicate with Elasticsearch. To specify the data in cURL, we can specify it in the following forms:

  • A command line: You can use the -d option to specify the JSON to be sent in the command line itself, for example:
    curl –X POST 'http://localhost:9200/news/public/' –d '{ "time" : "12-10-2010"}
    
  • A file: If the JSON is too long or inconvenient to be mentioned in a command line, you can specify it in a file or ask cURL to pick the JSON up from the file. You need to use the same -d option with a @ symbol just before the filename, for example:
    curl –X POST 'http://localhost:9200/news/public/' –d @file
    

Shards and replicas

The concept of sharding is introduced in Elasticsearch to provide horizontal scaling. Scaling, as you know, is to increase the capacity of the search engine, both the index size and the query rate (query per second) capacity. Let's say an application can store up to 1,000 feeds and gives reasonable performance. Now, we need to increase the performance of this application to 2,000 feeds. This is where we look for scaling solutions. There are two types of scaling solutions:

  • Vertical scaling: Here, we add hardware resources, such as more main memory, more CPU cores, or RAID disks to increase the capacity of the application.
  • Horizontal scaling: Here, we add more machines to the system. As in our example, we bring in one more machines and give both the machines 1,000 feeds each. The result is computed by merging the results from both the machines. As both the processes take place in parallel, they won't eat up more time or bandwidth.

Guess what! Elasticsearch can be scaled both horizontally and vertically. You can increase its main memory to increase its performance and you can simply add a new machine to increase its capacity. Horizontal scaling is implemented using the concept of sharding in Elasticsearch. Since Elasticsearch is a distributed system, we need to address our data safety/availability concerns. Using replicas we achieve this. When one replica (size 1) is defined for a cluster with more than one machine, two copies of the entire feed become available in the distributed system. This means that even if a single machine goes down, we won't lose data and at the same time. The load would be distributed somewhere else. One important point to mention here is that the default number of shards and replicas are generally sufficient and also, we have the provision to change the replica number later on.

This is how we create an index and pass the number of shards and replicas:

curl -X PUT "localhost:9200/news" -d '{
"settings": {
"index": {
"number_of_shards": 2,
"number_of_replicas": 1
}
}
}'

A few things to be noted here are:

  • Adding more primary shards will increase the write throughout the index
  • Adding more replicas will increase the durability of the index and the read throughout, at the cost of disk space

Index-type mapping

An index is a grouping logic where feeds of the same type are encapsulated together. A type is a sub grouping logic under index. To create a type under index, you need to decide on a type name. As in our case, we take the index name as news and the type name as public. We created the index in the previous step and now we need to define the data types of the fields that our data hold in the type mapping section.

Check out the sample given next. Here, the date data type takes the time format to be yyyy/MM/dd HH:mm:ss by default:

curl -X PUT "localhost:9200/news/public/_mapping" -d '{
"public" :{
"properties" :{
"Title" : {"type" : "string" },
"Content": {"type" : "string" },
"DOP": {"type" : "date" }
}
}
}'

Once we apply mapping, certain aspects of it such as new field definitions can be updated. However, we can't update certain other aspects such as changing the type of a field or changing the assigned analyzer. So, we now know how to create an index and add necessary mappings to the index we created. There is another important thing that you must take care of while indexing your data, that is, the analysis of our data. I guess you already know the importance of analysis. In simple terms, analysis is the breaking down of text into an elementary form called tokens. This tokenization is a must and has to be given serious consideration. Elasticsearch has many built-in analyzers that do this job for you. At the same time, you are free to deploy your own custom analyzers as well if the built-in analyzers do not serve your purpose. Let's see analysis in detail and how we can define analyzers for fields.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image