Using SQL operators for data quality
Good data quality is crucial for an organization to ensure the effectiveness of its data systems. By performing quality checks within the DAG, it is possible to stop pipelines and notify stakeholders before erroneous data is introduced into a production lake or warehouse.
Although plenty of available tools in the market provide data quality checks, one of the most popular ways to do this is by running SQL queries. As you may have already guessed, Airflow has providers to support those operations.
This recipe will cover the data quality principal topics in the data ingestion process, pointing out the best SQLOperator
type to run in those situations.
Getting ready
Before starting our exercise, let’s create a simple Entity Relationship Diagram (ERD) for a customers
table. You can see here how it looks:
Figure 10.40 – An example of customers table columns
And the same table is represented with...