Managing duplicate and redundant data
Often in statistics or data analytics, we are told that more data is better, but that isn’t always true. If the data is duplicated or redundant, it can cause issues with skew, bias, or completely invalidate your analysis. Here, we will discuss the different ways you can have too much data, how this will impact your results, and what you can do about it.
Duplicate data
Duplicate data is when a specific data point recurs multiple times within a dataset. If we are looking at a spreadsheet, it means there are multiple rows with completely identical values. In the following table, we see a simple example of duplicate data:
Employee ID |
LastName |
FirstName |
Department |
Years With Company |
83784 |
Benhill |
Floyd ... |