Ensuring that the data is complete and not missing
Now that we have achieved data consistency and uniqueness, it’s time to identify and address other quality issues. One such issue is missing information in the data or incomplete data. Missing data is a common problem with real datasets. As a dataset’s size increases, the chance of data points going missing in the data increases. Missing records can occur in several ways, some of which include:
- Merging of source datasets: For example, when we try to match records against date of birth or a postcode to enrich data, and either of these is missing in one dataset or is inaccurate, such occurrences will take NA values.
- Random events: This is quite common in surveys, where the person may not be aware of whether the information required is compulsory or they may not know the answer.
- Failures of measurement: For example, some traits, such as blood pressure, are known to have a very substantial component of random...