There are two major functions that Solr supports—indexing and searching. Initially, the data is uploaded to Apache Solr through various means; there are handlers to handle data within specific category (XML, CSV, PDF, database, and so on). Once the data is uploaded, it goes through a stage of cleanup called update processor chain. In this chain, initially, the de-duplication phase can be used to remove duplicates in the data to avoid them from appearing in the index unnecessarily. Each update handler can have its own update processor chain that can do document-level operations prior to indexing, or even redirect indexing to a different server or create multiple documents (or zero) from a single one. The data is then transformed depending upon the type.
Apache Solr can run in a master-slave mode. Index replicator is responsible for distributing indexes across multiple slaves. The master server maintains index update and the slaves are responsible for talking with the master to get them replicated for high availability. Apache Lucene core gets packages as library with the Apache Solr application. It provides core functionality for Solr such as index, query processing, searching data, ranking matched results, and returning them back.
Apache Lucene comes with a variety of query implementations. Query parser is responsible for parsing the queries passed by the end search as a search string. Lucene provides TermQuery
, BooleanQuery
, PhraseQuery
, PrefixQuery
, RangeQuery
, MultiTermQuery
, FilteredQuery
, SpanQuery
, and so on as query implementations.
Index searcher is a basic component of Solr searched with a default base searcher class. This class is responsible for returning ordered match results of searched keyword ranked as per the computed score. The index reader provides access to indexes stored in the filesystem. It can be used to search for an index. Similar to the index searcher, an index writer allows you to create and maintain indexes in Apache Lucene.
The analyzer is responsible for examining the fields and generating tokens. Tokenizer breaks field data into lexical units or tokens. The filter examines the field of tokens from the tokenizer and either it keeps them and transforms them, or discards them and creates new ones. Tokenizers and filters together form a chain or pipeline of analyzers. There can only be one tokenizer per analyzer. The output of one chain is fed to another. Analyzing the process is used for indexing as well as querying by Solr. They play an important role in speeding up the query as well as index time; they also reduce the amount of data that gets generated out of these operations. You can define your own customer analyzers depending upon your use case. In addition to the analyzer, Apache Solr allows administrators to make the search experience more effective by means of taking out common words such as is
, and
, and are
through the stopwords feature. Solr supports synonyms, thereby not limiting search to purely text match. Through the processing of stemming, all words such as played, playing, play can be transformed into the base form. We are going to look at these features in the coming chapters and the appendix. Similar to stemming, the user can search for multiterms of a single word as well (for example, play, played, playing). When a user fires a search query on Solr, it actually gets passed on to a request handler. By default, Apache Solr provides DisMaxRequestHandler
. You can visit http://wiki.apache.org/solr/DisMaxRequestHandler to find more details about this handler. Based on the request, the request handler calls the query parser. You can see an example of the filter in the following figure:
The query parser is responsible for parsing the queries, and converting it to Lucene query objects. There are different types of parsers available (Lucene, DisMax, eDisMax, and so on). Each parser offers different functionalities and it can be used based on the requirements. Once a query is parsed, it hands it over to the index searcher. The job of the index reader is to run the queries on the index store and gather the results to the response writer.
The response writer is responsible for responding back to the client; it formats the query response based on the search outcomes from the Lucene engine. The following figure displays the complete process flow when a search is fired from the client:
Apache Solr ships with an example schema that runs using Apache velocity. Apache velocity is a fast open source templates engine, which quickly generates HTML-based frontend. Users can customize these templates as per their requirements, although it is not used for production in many cases.
Index handlers are a type of update handler, handling the task of add, update, and delete function on documents for indexing. Apache Solr supports updates through the index handler through JSON, XML, and text format.
Data Import Handler (DIH) provides a mechanism for integrating different data sources with Apache Solr for indexing. The data sources could be relational databases or web-based sources (for example, RSS, ATOM feeds, and e-mails).
Tip
Although DIH is a part of Solr development, the default installation does not include it in the Solr application; they need to be included in the application explicitly.
Apache Tika, a project in itself extends capabilities of Apache Solr to run on top of different types of files. When a document is assigned to Tika, it automatically determines the type of file, that is, Word, Excel, PDF and extracts the content. Tika also extracts document metadata such as author, title, and creation date, which if provided in schema, go as text field in Apache Solr. This can later be used as facets for the search interface.