Removing duplicate documents (deduplication)
Solr provides us with a way to prevent duplicate or nearly duplicate elements to get indexed using a signature/fingerprint field. It natively provides a deduplication technique of this type via the signature class, and this can further be used to implement new hash and signature implementations.
Let's see how we can implement deduplication in Solr. We'll use our musicCatalog
core, which we used in the previous chapter as well, and will modify it:
Copy the
musicCatalog
core and create a new core calledmusicCatalog-dedupe
from it. After we have created the new core, we'll changeschema.xml
to add a signature field that will contain the document signature/fingerprint:<!-- Field to store the fingerprint/signature --> <field name="signature" type="string" indexed="true" stored="true" required="true" multiValued="false" />
After adding the field, we'll add a new
UpdateRequestProcessor
element tosolrconfig.xml
configuration file, which will...