Configuring SolrCloud for high-indexing use cases
Solr is designed to work under high load, both when it comes to querying and indexing. However, the default configuration provided with the example Solr deployment is not sufficient when it comes to these use cases. This recipe will show you how to prepare your SolrCloud collection configuration for use cases when the indexing rate is very high.
Getting ready
Before continuing reading the recipe, read the Running Solr on a standalone Jetty and Configuring SolrCloud for NRT use cases recipes in this chapter.
How to do it...
In very high indexing use cases, there are chances that you'll use bulk indexing to index your data. In addition to this, because we are talking about SolrCloud, we'll use autocommit so that we can leave the data durability and visibility management to Solr. Let's discuss how to prepare configuration for a use case where indexing is high, but the querying is quite low; for example, when using Solr for log centralization solutions.
Let's assume that we are indexing more than 1,000 documents per second and that we have four nodes, each of 12 cores and 64 GB of RAM. Note that this specification is not something we need to index the number of documents, but they are here for reference.
- First, we'll start with the autocommit configuration, which will look as follows (we add this to the
solrconfig.xml
file):<updateHandler class="solr.DirectUpdateHandler2"> <updateLog> <str name="dir">${solr.ulog.dir:}</str> </updateLog> <autoSoftCommit> <maxTime>600000</maxTime> </autoSoftCommit> <autoCommit> <maxTime>15000</maxTime> <openSearcher>false</openSearcher> </autoCommit> </updateHandler>
- The second step is to adjust the number of indexing threads. To do this, we add the following information to the
indexConfig
section ofsolrconfig.xml
:<maxIndexingThreads>10</maxIndexingThreads>
- The third step is to adjust the memory buffer size for each indexing thread. To do this, we add the following information to the
indexConfig
section ofsolrconfig.xml
:<ramBufferSizeMB>128</ramBufferSizeMB>
Now, let's discuss what each of these changes mean.
How it works...
We started with tuning the autocommit setting, which you should be aware of after reading this recipe. Since we are not worried about documents being visible as soon as they are indexed, we set the soft autocommit's maxTime
property to 600000
. This means that we will reopen the searcher every 10 minutes, so our documents will be visible maximum 10 minutes after they are sent to indexation.
The one thing to look at is the short time for hard commit, which is every 15 seconds (the maxTime
property of the autoCommit
section set to 15000
). We did this because we don't want transaction logs to contain a high number of entries because this can cause problems during the recovery process.
We also increased the default number of threads an index writer can use from the default 8
to 10
by setting the maxIndexingThreads
property. Since we have 12 cores on each machine, and we are not querying much, we can allow more threads using the index writer. If the index writer uses the number of threads that's equal to the maxIndexingThreads
property, the next thread will wait for one of the currently running to end. Remember that the maxIndexingThreads
property sets the maximum allowed indexing threads, which doesn't mean they will be used every time.
We also increased the default RAM buffer size from 100
to 128
using the ramBufferSizeMB
property. We did this to allow Lucene to buffer as many documents as needed in memory. If the size of the documents in the buffer is larger than the given value of the ramBufferSizeMB
property, Lucene will flush the data to the directory, which will decide what else to do. We have to remember though that we are also using autocommit, so the data will be flushed every 15 seconds because of hard autocommit settings.
Note
Remember that we didn't take into consideration the size of the cluster because we had the maximum number of nodes. You should remember that if I/O is the bottleneck when indexing, spreading the collection among more nodes should help with the indexing load.
In addition to this, you might want to look at the merging policy and segment merge processes as this can become a major bottleneck. If you are interested, refer to the Tuning segment merging recipe in Chapter 9, Dealing with Problems.