Exploring string clustering
String clustering may be one of the most underrated data-cleansing functions. To explain what string clustering is for, we can refer to the definition from OpenRefine, which says: "find groups of different values that could be alternative representations of the same thing":
For example, you may have a column with values like this:
A (object) ------------- Optimus Optimus Prime Prime
All these values can be represented as the same string since they reference the same thing—for example, "Optimus"
or "Optimus Prime"
are valid options, depending on the need. Optimus will give you the tools to apply different string-clustering methods, suggest a value that best represents what you want, and then replace the values to achieve a cohesive representation of the data.
Optimus gives us the possibility to use different string-clustering methods, from some fast and less accurate methods such as fingerprinting to more...