Identity matching
In this section, we will cover one important data preparation topic, which is about identity matching and related solutions. We will discuss some of Spark's special features for solving identity issues and also some data matching solutions made easy with Spark.
After this section, we will be capable of taking care of some common data identity problems with Apache Spark.
Identity issues
For data preparation, we often need to deal with some data elements that belong to the same person or units, but which do not look similar to them. For example, we may have purchased some data for customer Larry Z. and web activity data for L. Zhang. Is Larry Z a same person as L. Zhang? Are there many identity variations in the data?
Matching entities is a big challenge for machine learning data preparation as these types of entity variation are very common and could be caused by many different reasons, such as duplications, errors, name variants, and intentional aliasing. Sometimes, it...