Setting up a serverless data quality pipeline with Athena
Data quality validation is a very important step in data wrangling pipelines, ensuring the accuracy of data that will be used in analysis and visualization. We will explore in this section how to perform data quality validation through Amazon Athena.
Implementing data quality rules in Athena
Let us consider the rules that we want to validate in the NOAA weather dataset. What follows is only a high-level representation of some data quality rules and not a comprehensive ruleset for the weather dataset:
- The state column should have two character values when the country code is
US
. - The date field shouldn’t have any future-dated values that would be incorrect measurements.
- Validate that the element column has only accepted the list of values as provided in the documentation.
We can have more rules that will ensure better data quality, but the preceding rules are sufficient for us to demonstrate...