You're reading from Solr Cookbook - Third Edition Solve real-time problems related to Apache Solr 4.x and 5.0 effectively with the help of over 100 easy-to-follow recipes

Product type Paperback

Published in Jan 2015

Publisher

ISBN-13 9781783553150

Length 356 pages

Edition 3rd Edition

Languages

Java

Tools

Solr

Concepts

Enterprise Search

Author (1):

Rafal Kuc

View More author details

Table of Contents (12) Chapters

Preface

1. Apache Solr Configuration FREE CHAPTER

2. Indexing Your Data

3. Analyzing Your Text Data

4. Querying Solr

5. Faceting

6. Improving Solr Performance

7. In the Cloud

8. Using Additional Functionalities

9. Dealing with Problems

10. Real-life Situations

Index

Configuring SolrCloud for high-querying use cases

One of the things that Solr is really great for is high-querying use cases. Whether they are distributed queries using SolrCloud or single node queries running in master-slave environments, Solr does very well when it comes to handling queries and scaling. In this recipe, we will concentrate on use cases where we index quite a small amount of documents per second, but we want to have them at low latency.

Getting ready

Before continuing to read this recipe, read the Running Solr on a standalone Jetty, Configuring SolrCloud for NRT use cases, and Configuring SolrCloud for high-indexing use cases recipes of this chapter.

How to do it...

Giving general advice for high-querying use cases is pretty hard because it very much depends on the data, cluster structure, query structure, and target latency. In this recipe, we will look at three things—configuration, scaling, and overall general advices. Let's assume that we have four nodes, each having 128 GB of RAM and large disks, and we have 100 million documents we want to search across.

We should start with sizing our cluster. In general, this means choosing the right number of nodes, the right number of shards and replicas for your collections, and the memory. The general advice is to index some portion of your data and see how much space is used. For example, assuming you've indexed 1,000 documents and they are taking 1 MB of disk space, we can now calculate the disk space needed by 100 million documents; this will give us about 100 GB of total disk space used. With a replication factor of 2, we will need 200 GB, which means our four nodes should be enough to have the data cached by the operating system. In addition to this, we will need memory for Solr to operate (we can help ourselves calculate how much we will need using http://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/size-estimator-lucene-solr.xls).

Given these facts, we can end up with a minimum of four shards and a replication factor of 2, which will give us a leader shard and its replica for each of the four initial shards we created the collection with. However, going for more initial shards might be better for scaling in the later stage of your application life cycle.

After we know some information, we can prepare the autocommit settings. To do this, we alter our solrconfig.xml configuration file and include the following update handler configuration:

<updateHandler class="solr.DirectUpdateHandler2">

 <updateLog>
  <str name="dir">${solr.ulog.dir:}</str>
 </updateLog>

 <autoSoftCommit>
  <maxTime>30000</maxTime>
 </autoSoftCommit>

 <autoCommit>
  <maxTime>600000</maxTime>
  <openSearcher>false</openSearcher>
 </autoCommit>
</updateHandler>

In addition to this, we should adjust caching, which is covered in the Configuring the document cache, Configuring the query result cache, and Configuring the filter cache recipes in Chapter 6, Improving Solr Performance.

In addition to all this, you might want to look at the merging policy and segment merge processes as this can become a major bottleneck. If you are interested, refer to the Tuning segment merging recipe in Chapter 9, Dealing with Problems.

How it works...

We started with sizing questions and estimations. Remember that the numbers you will extrapolate from the small portion of data are not exact numbers, they are estimations. What's more, we now know that in order to have our index fully cached by the operating system, we will need at least 200 GB of RAM memory that can be used for the system cache because we will have at least one shard and its physical copy. Of course, the four nodes with 128 GB of RAM are more or less a perfect case when we will be able to have our indices cached. This is because we will have a total of 512 GB of RAM across all nodes. Given the fact that we will end up with four leader shards, one on each machine, four replicas, again one on each machine, and that our index will be evenly divided, it will give us 50 GB of data on each node (25 GB for leader and the same for replica because it is an exact copy).

A few words about having more shards—sometimes, if you expect your data to grow, it is good to create a collection with more shards initially and place multiple ones on a single node. This gives more flexibility when you add new nodes; you can migrate some shards without the need to split them, or you can create a new collection with new shards and reindex your data.

Next, we adjust the autocommit section. Since we don't need near real-time searching, we decide not to stress Solr too much and set the soft autocommit to 60000 milliseconds, which means that the data will be visible after 1 minute from indexing. In general, if you will, the more often you reopen the searcher, the more pressure is put on Solr, and thus, the queries will be slower. So, if you query heavily, you should set the soft autocommit to the maximum time allowed by your use case.

Of course, we also included the hard autocommit and set it to be executed every 10 minutes. We decided to go for this because we don't index much data, so the index shouldn't be changed too often, and the transaction log shouldn't be too large.