Identifying and removing duplicate data
One problem when cleaning up data is dealing with duplicates. How do we find them? What do we do with them once we have them? While a part of this process can be automated, often merging duplicated data is a manual task, because a person has to look at potential matches and determine whether they are duplicates or not and determining what needs to be done with the overlapping data. We can code heuristics, of course, but at some point, a person needs to make the final call.
The first question that needs to be answered is what constitutes identity for the data. If you have two items of data, which fields do you have to look at in order to determine whether they are duplicates? Then, you must determine how close they need to be.
For this recipe, we'll examine some data and decide on duplicates by doing a fuzzy comparison of the name fields. We'll simply return all of the pairs that appear to be duplicates.
Getting ready
First, we need to add the...