You're reading from Distributed Data Systems with Azure Databricks Create, deploy, and manage enterprise data pipelines

Product type Paperback

Published in May 2021

Publisher Packt

ISBN-13 9781838647216

Length 414 pages

Edition 1st Edition

Languages

Python

Tools

Azure

Concepts

Data Science

Author (1):

Palacio

View More author details

Table of Contents (17) Chapters

Preface

1. Section 1: Introducing Databricks

2. Chapter 1: Introduction to Azure Databricks FREE CHAPTER

3. Chapter 2: Creating an Azure Databricks Workspace

4. Section 2: Data Pipelines with Databricks

5. Chapter 3: Creating ETL Operations with Azure Databricks

6. Chapter 4: Delta Lake with Azure Databricks

7. Chapter 5: Introducing Delta Engine

8. Chapter 6: Introducing Structured Streaming

9. Section 3: Machine and Deep Learning with Databricks

10. Chapter 7: Using Python Libraries in Azure Databricks

11. Chapter 8: Databricks Runtime for Machine Learning

12. Chapter 9: Databricks Runtime for Deep Learning

13. Chapter 10: Model Tracking and Tuning in Azure Databricks

14. Chapter 11: Managing and Serving Models with MLflow and MLeap

15. Chapter 12: Distributed Deep Learning in Azure Databricks

16. Other Books You May Enjoy

Using different sources with continous streams

Streams of data can come from a variety of sources. Structured Streaming provides support from extracting data from sources such as Delta tables, publish/subscribe (pub/sub) systems such as Azure Event Hubs, and more. We will review some of these sources in the next sections to learn how we can connect these streams of data into our jobs running in Azure Databricks.

Using a Delta table as a stream source

As mentioned in the previous chapter, you can use Structured Streaming with Delta Lake using the readStream and writeStream Spark methods, with a particular focus on overcoming issues related to handling and processing small files, managing batch jobs, and detecting new files efficiently.

When a Delta table is used as a data stream source, all the queries done on that table will process the information on that table as well as any data that has arrived since the stream started.

In the next example, we will load both the path...