The six key dimensions of data quality
There are six key dimensions we can use to check the overall health of data. Ensuring good health across the data can ensure we can build reliable systems and make better decisions. For example, if 20% of survey data is duplicated, and the majority of the duplicates are filled by male candidates, we can imagine that the actions taken by decision-makers will favor the male candidates if data duplication is undetected. Hence, it’s important to understand the overall health of the data to make reliable and unbiased decisions. To measure data quality or look at the overall health of the data, we can break down data quality into the following dimensions:
- Consistency: This refers to whether the same data is maintained across the rows for a given column or feature. An example of this could be whether the gender label for males is consistent or not. The label can take values of “1,” “Male,” “M”...