Changing similarity
Most times, the default way to calculate the score of your documents is what you need. However, sometimes you need more from Solr than just the standard behavior. For example, you might want shorter documents to be more valuable compared to longer ones. Let's assume that you want to change the default behavior and use different score calculation algorithms for the description
field of your index. This recipe will show you how to leverage this functionality.
Getting ready
Before choosing one of the score calculation algorithms available in Solr, it's good to read a bit about them. The detailed description of all the algorithms is beyond the scope of this recipe and the book (although a simple description is mentioned later in the recipe), but I suggest visiting the Solr wiki page (or Javadocs) and reading basic information about the available implementations.
How to do it...
For the purpose of this recipe, let's assume we have the following index structure (just add the following entries to your schema.xml
file):
<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="name" type="text_general" indexed="true" stored="true"/> <field name="description" type="text_general_dfr" indexed="true" stored="true" />
The string
and text_general
types are available in the default schema.xml
file provided with the example Solr distribution. However, we want DFRSimilarity
to be used to calculate the score for the description
field. In order to do this, we introduce a new type, which is defined as follows (just add the following entries to your schema.xml
file):
<fieldType name="text_general_dfr" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <similarity class="solr.DFRSimilarityFactory"> <str name="basicModel">P</str> <str name="afterEffect">L</str> <str name="normalization">H2</str> <float name="c">7</float> </similarity> </fieldType>
Also, to use the per-field similarity, we have to add the following entry to your schema.xml
file:
<similarity class="solr.SchemaSimilarityFactory"/>
That's all. Now, let's have a look and see how this works.
How it works...
The index structure previously presented is pretty simple as there are only three fields. The one thing we are interested in is that the description
field uses our own custom field type called text_generanl_dfr
.
The thing we are most interested in is the new field type definition called text_general_dfr
. As you can see, apart from the index and query analyzer, there is an additional section called similarity
. It is responsible for specifying which similarity implementation to use to calculate the score for a given field. You are probably used to defining field types, filters, and other things in Solr, so you probably know that the class
attribute is responsible for specifying the class that implements the desired similarity implementation, in our case, solr.DFRSimilarityFactory
. Also, if there is a need, you can specify additional parameters that configure the behavior of your chosen similarity class. In the previous example, we specified the four additional parameters of basicModel
, afterEffect
, normalization
, and c
, all of which define the DFRSimilarity
behavior.
The solr.SchemaSimilarityFactory
class is required to specify the similarity for each field.
Although the recipe is not about all the similarities available, I wanted to list the available ones. Note that each similarity might require and use different configuration parameters (all of them are described in the provided Javadocs). The list of currently available similarity factories are:
solr.DefaultSimilarityFactory
: This is the default Lucene similarity implementing the default scoring algorithm (the Javadoc is available at http://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/search/similarities/DefaultSimilarityFactory.html).solr.SweetSpotSimilarityFactory
: This is the extension to the default similarity factory, providing additional parameters to tune scoring behaviors (the Javadoc is available at http://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html).solr.BM25SimilarityFactory
: This is the similarity model that bases the score calculation on the probabilistic model, estimating the probability of finding a document for a given query. It is said that this similarity performs best on short texts (the Javadoc is available at http://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/search/similarities/BM25SimilarityFactory.html).solr.DFRSimilarityFactory
: This similarity is based on the divergence from the randomness probability model (the Javadoc is available at http://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/search/similarities/DFRSimilarityFactory.html).solr.IBSimilarityFactory
: This similarity is based on the information-based probability model, which is similar to the one used for divergence from the randomness model (the Javadoc is available at http://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/search/similarities/IBSimilarityFactory.html).solr.LMDirichletSimilarityFactory
: This similarity is based on Bayesian smoothing using Dirichlet priors (the Javadoc is available at http://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/search/similarities/LMDirichletSimilarityFactory.html).solr.LMJelinekMercerSimilarityFactory
: This similarity is based on the Jelinek-Mercer smoothing method (the Javadoc is available at http://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/search/similarities/LMJelinekMercerSimilarityFactory.html).Note
Note that after the similarity model changes, full document reindexing should be performed.
There's more...
In addition to per-field similarity definition, you can also configure the global similarity.
Changing the global similarity
Apart from specifying the similarity class on a per-field basis, you can choose fields other than the default one in a global way. For example, if you want to use BM25Similarity
as the default field, you should add the following entry to your schema.xml
file:
<similarity class="solr.BM25SimilarityFactory"/>
As with the per-field similarity, you need to provide the name of the factory class that is responsible for creating the appropriate similarity class.