Introducing Spark Streaming
In Chapter 10, Fetching and Persisting Bitcoin Market Data, we used Spark to save transactions in a batch mode. The batch mode is fine when you have to perform an analysis on a bunch of data all at once.
But in some cases, you might need to process data as it is entering into the system. For example, in a trading system, you might want to analyze all the transactions done by the broker to detect fraudulent transactions. You could perform this analysis in batch mode after the market is closed; but in this case, you can only act after the fact.
Spark Streaming allows you to consume a streaming source (file, socket, and Kafka topic) by dividing the input data into many micro-batches. Each micro-batch is an RDD that can then be processed by the Spark Engine. Spark divides the input data using a time window. So if you define a time window of 10 seconds, then Spark Streaming will create and process a new RDD every 10 seconds:
Going back to our fraud detection system, by...