Using the Structured Streaming API
Structured Streaming is integrated into the PySpark API and embedded in the Spark DataFrame API. It provides ease of use when working with streaming data and, in most cases, it requires very small changes to migrate from a computation on static data to a streaming computation. It provides features to perform windowed aggregation and for setting the parameters of the execution model.
As we have discussed in previous chapters, in Azure Databricks, streams of data are represented as Spark dataframes. We can verify that the data frame is a stream of data by checking that the isStreaming
property of the data frame is set as true
. In order to operate with Structured Streaming, we can summarize the steps as read, process, and write, as exemplified here:
- We can read streams of data that are being dumped in, for example, an S3 bucket. The following example code shows how we can use the
readStream
method, specifying that we are reading a comma-separated...