Identifying and removing duplicate cases
Datasets may contain duplicate records that often must be removed before data mining can begin. For example, the same individual may appear multiple times in a dataset with different addresses. The Distinct node finds or removes duplicate records in a dataset. The Distinct
node, located in the Record Ops
palette, checks for duplicate records and identifies the cases that appear more than once in a file so they can be reviewed and/or removed.
A duplicate case is defined by having identical data values on one or more fields that are specified. Any number or combination of fields may be used to specify a duplicate:
- Place a
Distinct
node from theRecord Ops
palette onto the canvas. - Connect the
Sort
node to the Distinct node. - Edit the
Distinct
node.
The Distinct
node can be a bit tricky to use; this is why we will run this node a couple of times, and hopefully in this way its options will become well-defined. The Mode
option controls how the Distinct
node is...