Key collision methods
Key collision methods are based on the idea of creating a reduced and meaningful representation of a value (a key) and putting equal ones together in buckets.
Optimus has implemented three methods that fall into this category: fingerprinting, n-gram fingerprinting, and phonetic fingerprinting.
Fingerprinting
A fingerprinting method is the least likely to generate false positives, which is why Optimus defaults to this.
Optimus implements the same algorithm as OpenRefine, an open source tool for working with messy data. The algorithm is described in the next code block.
The process that generates a key from a string value is outlined here and must be followed in this order:
- Remove leading and trailing whitespace (for example, from
" Optimus Prime"
to"Optimus Prime"
). - Change all characters to their lowercase representation (for example, from
"Optimus Prime"
to"optimus prime"
). - Remove all punctuation...