Working of a scorer on an inverted index
We have, so far, understood what an inverted index is and how relevance calculation works. Let us now understand how a scorer works on an inverted index. Suppose we have an index with the following three documents:
To index the document, we have applied WhitespaceTokenizer
along with the EnglishMinimalStemFilterFactory
class. This breaks the sentence into tokens by splitting whitespace, and EnglishMinimalStemFilterFactory
converts plural English words to their singular forms. The index thus created would be similar to that shown as follows:
A search for the term orange will give documents 2 and 3 in its result. On running a debug on the query, we can see that the scores for both the documents are different and document 2 is ranked higher than document 3. The term frequency of orange in document 2 is higher than that in document 3.
However, this does not affect the score much as the number of terms in the document is small...