We sometimes end up with duplicate cases in our datasets and want to retain only one among them.
Removing duplicate cases
Getting ready
Create a sample data frame:
> salary <- c(20000, 30000, 25000, 40000, 30000, 34000, 30000)
> family.size <- c(4,3,2,2,3,4,3)
> car <- c("Luxury", "Compact", "Midsize", "Luxury", "Compact", "Compact", "Compact")
> prospect <- data.frame(salary, family.size, car)
How to do it...
The unique() function can do the job. It takes a vector or data frame as an argument and returns an object of the same type as its argument, but with duplicates removed.
Remove duplicates to get unique values:
> prospect.cleaned <- unique(prospect)
> nrow(prospect)
[1] 7
> nrow(prospect.cleaned)
[1] 5
How it works...
The unique() function takes a vector or data frame as an argument and returns a similar object with the duplicate eliminated. It returns the non-duplicated cases as is. For repeated cases, the unique() function includes one copy in the returned result.
There's more...
Sometimes we just want to identify the duplicated values without necessarily removing them.
Identifying duplicates without deleting them
For this, use the duplicated() function:
> duplicated(prospect)
[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE
From the data, we know that cases 2, 5, and 7 are duplicates. Note that only cases 5 and 7 are shown as duplicates. In the first occurrence, case 2 is not flagged as a duplicate.
To list the duplicate cases, use the following code:
> prospect[duplicated(prospect), ]
salary family.size car
5 30000 3 Compact
7 30000 3 Compact