You're reading from Databricks ML in Action Learn how Databricks supports the entire ML lifecycle end to end from data ingestion to the model deployment

Product type Paperback

Published in May 2024

Publisher Packt

ISBN-13 9781800564893

Length 280 pages

Edition 1st Edition

Languages

Python

Tools

Spark NLP

Concepts

Machine Learning

Authors (4):

Hayley Horn

Amanda Baker

Anastasia Prokaieva

Stephanie Rivera

View More author details

Table of Contents (13) Chapters

Preface

1. Part 1: Overview of the Databricks Unified Data Intelligence Platform FREE CHAPTER

2. Chapter 1: Getting Started and Lakehouse Concepts

3. Chapter 2: Designing Databricks: Day One

4. Chapter 3: Building the Bronze Layer

5. Part 2: Heavily Project Focused

6. Chapter 4: Getting to Know Your Data

7. Chapter 5: Feature Engineering on Databricks

8. Chapter 6: Tools for Model Training and Experimenting

9. Chapter 7: Productionizing ML on Databricks

10. Chapter 8: Monitoring, Evaluating, and More

11. Index

Why subscribe?

12. Other Books You May Enjoy

Improving data integrity with DLT

In the last chapter, we introduced DLT as a helpful tool for streaming data and pipeline development. Here, we focus on how to use DLT as your go-to tool for actively tracking data quality. Generally, datasets are dynamic, not neat, and tidy like they often are in school and training. You can use code to clean data, of course, but there is a feature that makes the cleaning process even easier: DLT’s expectations. DLT’s expectations catch incoming data quality issues and automatically validate that incoming data passes specified rules and quality checks. For example, you might expect your customer data to have positive values for age or that dates follow a specific format. When data does not meet these expectations, it can negatively impact downstream data pipelines. With expectations implemented, you ensure that your pipelines won’t suffer.

Implementing expectations gives us more control over data quality, alerting us to unusual data requiring attention and action. There are several options when dealing with erroneous data in DLT:

First, we can set a warning, which will report the number of records that failed an expectation as a metric but still write those invalid records to the destination dataset; see @dlt.expect() in Figure 4.1 for an example.
Second, we can drop invalid records so that the final dataset only contains records that meet our expectations; see @dlt.expect_or_drop() in Figure 4.1.
Third, we can fail the operation entirely, so nothing new is written (note that this option requires manual re-triggering of the pipeline).
Finally, we can quarantine the invalid data to another table to investigate it further. The following code should look familiar to the DLT code in Chapter 3, but now with the addition of expectations.

Figure 4.1 – Using DLT expectations to enforce data quality

Let’s look at our streaming transactions project as an example. In the Applying our learning section in Chapter 3, we used DLT to write the transaction data to a table. Utilizing the same DLT code, we will save ourselves the manual effort of cleaning the CustomerID column by adding an expectation to the original code to drop any records with a null CustomerID. We will set another expectation to warn us if the Product field is null.

Now, when we call generate_table(), the DLT pipeline will automatically clean up our table by dropping any null CustomerID values and flagging records without a Product value. Moreover, DLT will automatically build helpful visualizations to immediately investigate the data’s quality.

To try this yourself, update the DLT code from Chapter 3 (here’s the path to the notebook as a reminder: Chapter 3: Building Out Our Bronze Layer/Project: Streaming Transactions/delta_live_tables/) to match Figure 4.1, and then rerun the DLT pipeline as you did before. Once the pipeline is complete, it generates the DAG. Click on the synthetic_transactions_silver table, then click the Data Quality tab from the table details. This will display information about the records processed, such as how many were written versus dropped for failing a given expectation, as shown in Figure 4.2.

Figure 4.2 – The DLT data quality visualizations

These insights illustrate how expectations help automatically clean up our tables and flag information that might be useful for data scientists using this table downstream. In this example, we see that all records passed the valid_CustomerID expectation, so now we know we don’t have to worry about null customer IDs in the table. Additionally, almost 80% of records are missing a Product value, which may be relevant for data science and machine learning (ML) projects that use this data.

Just as we’ve considered the correctness and consistency of incoming data, we also want to consider how we can expand our data quality oversight to include data drift, for example, when your data’s distribution changes over time. Observing data drift is where Databricks Lakehouse Monitoring emerges as a vital complement to DLT, offering a configurable framework to consistently observe and verify the statistical properties and quality of input data.

The rest of the chapter is locked

You're reading from Databricks ML in Action Learn how Databricks supports the entire ML lifecycle end to end from data ingestion to the model deployment

Table of Contents (13) Chapters

Improving data integrity with DLT

Unlock this book and the full library FREE for 7 days

Authors (4)

Personalised recommendations for you