Deduplicating records
When you start analyzing the business data, you may find that it’s incorrect and that there are multiple different notations of the same record.
The following example table contains duplicates:
Figure 6.3 – Customer table with duplicates
As you may have noticed, there are only four unique records in the preceding table. Two records have two different notations, which causes duplication. If you analyze the data with these kinds of duplicated records, the result may include unexpected bias, so you will get an incorrect result.
With AWS Glue, you can use the FindMatches
transform to find duplicated records. FindMatches
is one of the ETL transforms provided in the Glue ETL library. With the FindMatches
transform, you can match records and identify and remove duplicate records based on the ML model.
Let’s look at the end-to-end matching process:
- Register a table definition for your data in AWS Glue Data...