Challenges of large-scale indexing
Let us understand how indexing happens and what can be done to speed it up. We will also look at the challenges faced during the indexing of a large number of documents or bulky documents. An e-commerce site is a perfect example of a site containing a large number of products, while a job site is an example of a search where documents are bulky because of the content in candidate resumes.
During indexing, Solr first analyzes the documents and converts them into tokens that are stored in the RAM buffer. When the RAM buffer is full, data is flushed into a segment on the disk. When the numbers of segments are more than that defined in the MergeFactor
class of the Solr configuration, the segments are merged. Data is also written to disk when a commit is made in Solr.
Let us discuss a few points to make Solr indexing fast and to handle a large index containing a huge number of documents.
Using multiple threads for indexing on Solr
We can divide our data into smaller chunks and each chunk can be indexed in a separate thread. Ideally, the number of threads should be twice the number of processor cores to avoid a lot of context switching. However, we can increase the number of threads beyond that and check for performance improvement.
Using the Java binary format of data for indexing
Instead of using XML files, we can use the Java bin format for indexing. This reduces a lot of overhead of parsing an XML file and converting it into a binary format that is usable. The way to use the Java bin format is to write our own program for creating fields, adding fields to documents, and finally adding documents to the index. Here is a sample code:
//Create an instance of the Solr server String SOLR_URL = "http://localhost:8983/solr" SolrServer server = new HttpSolrServer(SOLR_URL); //Create collection of documents to add to Solr server SolrInputDocument doc1 = new SolrInputDocument(); document.addField("id",1); document.addField("desc", "description text for doc 1"); SolrInputDocument doc2 = new SolrInputDocument(); document.addField("id",2); document.addField("desc", "description text for doc 2"); Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>(); docs.add(doc1); docs.add(doc2); //Add the collection of documents to the Solr server and commit. server.add(docs); server.commit();
Here is the reference to the API for the HttpSolrServer
program http://lucene.apache.org/solr/4_6_0/solr-solrj/org/apache/solr/client/solrj/impl/HttpSolrServer.html.
Note
Add all files from the <solr_directory>/dist
folder to the classpath for compiling and running the HttpSolrServer
program.
Using the ConcurrentUpdateSolrServer class for indexing
Using the
ConcurrentUpdateSolrServer
class instead of the HttpSolrServer
class can provide performance benefits as the former uses buffers to store processed documents before sending them to the Solr server. We can also specify the number of background threads to use to empty the buffers. The API docs for ConcurrentUpdateSolrServer
are found in the following link: http://lucene.apache.org/solr/4_6_0/solr-solrj/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.html
The constructor for the ConcurrentUpdateSolrServer
class is defined as:
ConcurrentUpdateSolrServer(String solrServerUrl, int queueSize, int threadCount)
Here, queueSize
is the buffer and threadCount
is the number of background threads used to flush the buffers to the index on disk.
Note that using too many threads can increase the context switching between threads and reduce performance. In order to optimize the number of threads, we should monitor performance (docs indexed per minute) after each increase and ensure that there is no decrease in performance.
Solr configuration changes that can improve indexing performance
We can change the following directives in solrconfig.xml
file to improve indexing performance of Solr:
ramBufferSizeMB
: This property specifies the amount of data that can be buffered in RAM before flushing to disk. It can be increased to accommodate more documents in RAM before flushing to disk. Increasing the size beyond a particular point can cause swapping and result in reduced performance.maxBufferedDocs
: This property specifies the number of documents that can be buffered in RAM before flushing to disk. Make this a large number so that commit always happens on the basis of the RAM buffer size instead of the number of documents.useCompoundFile
: This property specifies whether to use a compound file or not. Using a compound file reduces indexing performance as extra overhead is required to create the compound file. Disabling a compound file can create a large number of file descriptors during indexing.Note
The default number of file descriptors available in Linux is 1024. Check the number of open file descriptors using the following command:
cat /proc/sys/fs/file-max
Check the hard and soft limits of file descriptors using the
ulimit
command:ulimit -Hn ulimit -Sn
To increase the number of file descriptors system wide, edit the file
/etc/sysctl.conf
and add the following line:fs.file-max = 100000
The system needs to be rebooted for the changes to take effect.
To temporarily change the number of file descriptors, run the following command as root:
Sysctl –w fs.file-max = 100000
- mergeFactor: Increasing the
mergeFactor
can cause a large number of segments to be merged in one go. This will speed up indexing but slow down searching. If the merge factor is too large, we may run out of file descriptors, and this may even slow down indexing as there would be lots of disk I/O during merging. It is generally recommended to keep the merge factor constant or lower it to improve searching.
Planning your commit strategy
Disable the
autocommit
property during indexing so that commit can be done manually. Autocommit can be a pain as it can cause too frequent commits. Instead, committing manually can reduce the overhead during commits by decreasing the number of commits. Autocommit can be disabled in the solrconfig.xml
file by setting the <autocommit><maxtime>
properties to a very large value.
Another strategy would be to configure the <autocommit><maxtime>
properties to a large value and use the autoSoftCommit
property for short-time commits to disk. Soft commits are faster as the commit is not synced to disk. Soft commits are used to enable near real time search.
We can also use the commitWithin
tag instead of the autoSoftCommit
tag. The former forces documents to be added to Solr via soft commit at certain intervals of time. The commitWithin
tag can also be used with hard commits via the following configuration:
<commitWithin><softCommit>false</softCommit></commitWithin>
Avoid using the autoSoftCommit
/ autoCommit
/ commitWithin
tags while adding bulk documents as it has a major performance impact.
Using better hardware
Indexing involves lots of disk I/O. Therefore, it can be improved by using a local file system instead of a remote file system. Also, using better hardware with higher IO capability, such as Solid State Drive (SSD), can improve writes and speed up the indexing process.
Distributed indexing
When dealing with large amounts of data to be indexed, in addition to speeding up the indexing process, we can work on distributed indexing. Distributed indexing can be done by creating multiple indexes on different machines and finally merging them into a single, large index. Even better would be to create the separate indexes on different Solr machines and use Solr sharding to query the indexes across multiple shards.
For example, an index of 10 million products can be broken into smaller chunks based on the product ID and can be indexed over 10 machines, with each indexing a million products. While searching, we can add these 10 Solr servers as shards and distribute our search queries over these machines.