Improving data integrity with DLT
In the last chapter, we introduced DLT as a helpful tool for streaming data and pipeline development. Here, we focus on how to use DLT as your go-to tool for actively tracking data quality. Generally, datasets are dynamic, not neat, and tidy like they often are in school and training. You can use code to clean data, of course, but there is a feature that makes the cleaning process even easier: DLT’s expectations. DLT’s expectations catch incoming data quality issues and automatically validate that incoming data passes specified rules and quality checks. For example, you might expect your customer data to have positive values for age or that dates follow a specific format. When data does not meet these expectations, it can negatively impact downstream data pipelines. With expectations implemented, you ensure that your pipelines won’t suffer.
Implementing expectations gives us more control over data quality, alerting us to unusual data requiring attention and action. There are several options when dealing with erroneous data in DLT:
- First, we can set a warning, which will report the number of records that failed an expectation as a metric but still write those invalid records to the destination dataset; see
@dlt.expect()
in Figure 4.1 for an example. - Second, we can drop invalid records so that the final dataset only contains records that meet our expectations; see
@dlt.expect_or_drop()
in Figure 4.1. - Third, we can fail the operation entirely, so nothing new is written (note that this option requires manual re-triggering of the pipeline).
- Finally, we can quarantine the invalid data to another table to investigate it further. The following code should look familiar to the DLT code in Chapter 3, but now with the addition of expectations.
Figure 4.1 – Using DLT expectations to enforce data quality
Let’s look at our streaming transactions project as an example. In the Applying our learning section in Chapter 3, we used DLT to write the transaction data to a table. Utilizing the same DLT code, we will save ourselves the manual effort of cleaning the CustomerID
column by adding an expectation to the original code to drop any records with a null
CustomerID
. We will set another expectation to warn us if the Product
field is null
.
Now, when we call generate_table()
, the DLT pipeline will automatically clean up our table by dropping any null CustomerID
values and flagging records without a Product
value. Moreover, DLT will automatically build helpful visualizations to immediately investigate the data’s quality.
To try this yourself, update the DLT code from Chapter 3 (here’s the path to the notebook as a reminder: Chapter 3
: Building
Out Our
Bronze Layer/Project: Streaming Transactions/delta_live_tables/
) to match Figure 4.1, and then rerun the DLT pipeline as you did before. Once the pipeline is complete, it generates the DAG. Click on the synthetic_transactions_silver
table, then click the Data Quality tab from the table details. This will display information about the records processed, such as how many were written versus dropped for failing a given expectation, as shown in Figure 4.2.
Figure 4.2 – The DLT data quality visualizations
These insights illustrate how expectations help automatically clean up our tables and flag information that might be useful for data scientists using this table downstream. In this example, we see that all records passed the valid_CustomerID
expectation, so now we know we don’t have to worry about null customer IDs in the table. Additionally, almost 80% of records are missing a Product
value, which may be relevant for data science and machine learning (ML) projects that use this data.
Just as we’ve considered the correctness and consistency of incoming data, we also want to consider how we can expand our data quality oversight to include data drift, for example, when your data’s distribution changes over time. Observing data drift is where Databricks Lakehouse Monitoring emerges as a vital complement to DLT, offering a configurable framework to consistently observe and verify the statistical properties and quality of input data.