Streaming ETL
Streaming use cases comprise three main categories of real-time applications – decision engines and alerting apps; BI analytics and tools, such as SQL and search engines; and data science and ML use cases, as highlighted in the following diagram:
In the next section, we will look at the three stages of ETL (Extract, Transform, Load) as it relates to streaming.
Extract – file-based versus event-based streaming
There are two types of stream processing – file-based and event-based. The former applies to data that has landed on disk, and the latter to data in flight, and which typically requires a streaming service such as Kafka, Kinesis, or EventHub from which spark.readStream
consumes the data. For example, a Kafka cluster consists of several brokers monitored by Zookeeper. Data is stored in topics that are broken down into one or more partitions that allow for scalability, fault...