Handling event time and late date
Event time is the time inside the data. Spark Streaming used to define the time as the received time for DStream purposes, but for many applications that need the event time, this is not enough. For example, if you require the number of times that a hashtag appears in a tweet every minute, then you will need the time when the data was generated, not the time when Spark received the event.
The following is an extension of the previous example of Structured Streaming, listening on server port 9999
. The Timestamp
is now enabled as a part of the input data, so now, we can perform window operations on the unbounded table:
import java.sql.Timestamp import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ // Creating DataFrame that represent the stream of input lines from connection to host:port val inputLines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .option("includeTimestamp", true) .load()...