Spark Streaming is an extension of the core Spark API that enables scalable and fault-tolerant, stream-oriented processing of data. Spark provides the ability to stream data from multiple sources, with a number of key sources being the following:
- Apache Kafka
- Amazon Kinesis and S3
- TCP
- HDFS
Spark offers two flavors of streaming:
- Spark Structured Streaming that is built on top of the Spark SQL engine
- Spark Discretized Stream (DStream), which uses a discretized stream—that is, a continuous stream of data
In this section, we will be exploring Spark DStreams and develop an understanding of how this could be leveraged to develop streaming solutions.
Let's start with a classic word count problem, where we are trying to count the frequency of each distinct word.