Machine learning-based record linkage
The record linkage problem is modeled as a machine learning problem. It is solved in both unsupervised and supervised manners. In cases where we only have the features of the tuples we want to de-dupe and don't have ground truth information, an unsupervised learning method such as K-means is employed.
Let us look at the unsupervised learning.
Unsupervised learning
Let's start with an unsupervised machine learning technique, K-means clustering. K-means is a well-known and popular clustering algorithm and works based on the principles of expectation maximization. It belongs to the class of iterative descent clustering methods. Internally, it assumes the variables are of quantitative type and uses Euclidean distance as a similarity measure to arrive at the clusters.
The K
is a parameter to the algorithm. K
stands for the number of clusters we need. Users need to provide this parameter.
Note
Refer to The Elements of Statistical Learning, Chapter 14 for a more...