Data Profiling and Data Quality
As we work with multiple sources of data, it is quite easy for some bad data to pass through if there are no checks in place. This can lead to serious issues in downstream systems that rely on the accuracy of upstream data to build models, run business-critical applications, and so on. To make our data pipelines resilient, it is imperative that we have data quality checks in place to ensure the data being processed meets the requirements imposed by both business as well as downstream applications.
Six primary data quality dimensions can be measured individually and used to improve the data quality:
- Completeness: Does your customer dataset that you plan to use for an upcoming marketing campaign have all of the required attributes filled in?
- Accuracy: Are the email addresses and phone numbers accurate for your customer records?
- Consistency: Is customer data consistent across systems?
- Validity: Do your customer records have valid...