Visualizing data on streaming data frames
When working with streams of data in Structured Streaming data frames, we can visualize real-time data using the display
function. This function is different from other visualizing functions because it allows us to specify options such as processingTime
and checkpointLocation
due to the real-time nature of the data. These options are set in order to manage the exact point in time we are visualizing and should be always be set in production in order to know exactly the state of the data that we are seeing.
In the following code example, we first define a Structured Streaming dataframe, and then we use the display
function to show the state of the data every 5 seconds of processing time, on a specific checkpoint location:
streaming_df = spark.readStream.format("rate").load() display(streaming_df.groupBy().count(), processingTime = "5 seconds", checkpointLocation = "<checkpoint-path>")
Specifically...