Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Databricks ML in Action

You're reading from   Databricks ML in Action Learn how Databricks supports the entire ML lifecycle end to end from data ingestion to the model deployment

Arrow left icon
Product type Paperback
Published in May 2024
Publisher Packt
ISBN-13 9781800564893
Length 280 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (4):
Arrow left icon
Hayley Horn Hayley Horn
Author Profile Icon Hayley Horn
Hayley Horn
Amanda Baker Amanda Baker
Author Profile Icon Amanda Baker
Amanda Baker
Anastasia Prokaieva Anastasia Prokaieva
Author Profile Icon Anastasia Prokaieva
Anastasia Prokaieva
Stephanie Rivera Stephanie Rivera
Author Profile Icon Stephanie Rivera
Stephanie Rivera
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Part 1: Overview of the Databricks Unified Data Intelligence Platform FREE CHAPTER
2. Chapter 1: Getting Started and Lakehouse Concepts 3. Chapter 2: Designing Databricks: Day One 4. Chapter 3: Building the Bronze Layer 5. Part 2: Heavily Project Focused
6. Chapter 4: Getting to Know Your Data 7. Chapter 5: Feature Engineering on Databricks 8. Chapter 6: Tools for Model Training and Experimenting 9. Chapter 7: Productionizing ML on Databricks 10. Chapter 8: Monitoring, Evaluating, and More 11. Index 12. Other Books You May Enjoy

Improving data integrity with DLT

In the last chapter, we introduced DLT as a helpful tool for streaming data and pipeline development. Here, we focus on how to use DLT as your go-to tool for actively tracking data quality. Generally, datasets are dynamic, not neat, and tidy like they often are in school and training. You can use code to clean data, of course, but there is a feature that makes the cleaning process even easier: DLT’s expectations. DLT’s expectations catch incoming data quality issues and automatically validate that incoming data passes specified rules and quality checks. For example, you might expect your customer data to have positive values for age or that dates follow a specific format. When data does not meet these expectations, it can negatively impact downstream data pipelines. With expectations implemented, you ensure that your pipelines won’t suffer.

Implementing expectations gives us more control over data quality, alerting us to unusual data requiring attention and action. There are several options when dealing with erroneous data in DLT:

  1. First, we can set a warning, which will report the number of records that failed an expectation as a metric but still write those invalid records to the destination dataset; see @dlt.expect() in Figure 4.1 for an example.
  2. Second, we can drop invalid records so that the final dataset only contains records that meet our expectations; see @dlt.expect_or_drop() in Figure 4.1.
  3. Third, we can fail the operation entirely, so nothing new is written (note that this option requires manual re-triggering of the pipeline).
  4. Finally, we can quarantine the invalid data to another table to investigate it further. The following code should look familiar to the DLT code in Chapter 3, but now with the addition of expectations.
Figure 4.1 – Using DLT expectations to enforce data quality

Figure 4.1 – Using DLT expectations to enforce data quality

Let’s look at our streaming transactions project as an example. In the Applying our learning section in Chapter 3, we used DLT to write the transaction data to a table. Utilizing the same DLT code, we will save ourselves the manual effort of cleaning the CustomerID column by adding an expectation to the original code to drop any records with a null CustomerID. We will set another expectation to warn us if the Product field is null.

Now, when we call generate_table(), the DLT pipeline will automatically clean up our table by dropping any null CustomerID values and flagging records without a Product value. Moreover, DLT will automatically build helpful visualizations to immediately investigate the data’s quality.

To try this yourself, update the DLT code from Chapter 3 (here’s the path to the notebook as a reminder: Chapter 3: Building Out Our Bronze Layer/Project: Streaming Transactions/delta_live_tables/) to match Figure 4.1, and then rerun the DLT pipeline as you did before. Once the pipeline is complete, it generates the DAG. Click on the synthetic_transactions_silver table, then click the Data Quality tab from the table details. This will display information about the records processed, such as how many were written versus dropped for failing a given expectation, as shown in Figure 4.2.

Figure 4.2 – The DLT data quality visualizations

Figure 4.2 – The DLT data quality visualizations

These insights illustrate how expectations help automatically clean up our tables and flag information that might be useful for data scientists using this table downstream. In this example, we see that all records passed the valid_CustomerID expectation, so now we know we don’t have to worry about null customer IDs in the table. Additionally, almost 80% of records are missing a Product value, which may be relevant for data science and machine learning (ML) projects that use this data.

Just as we’ve considered the correctness and consistency of incoming data, we also want to consider how we can expand our data quality oversight to include data drift, for example, when your data’s distribution changes over time. Observing data drift is where Databricks Lakehouse Monitoring emerges as a vital complement to DLT, offering a configurable framework to consistently observe and verify the statistical properties and quality of input data.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image