Before covering online learning with Spark, we will first explore the basics of stream processing and introduce the Spark Streaming library.
In addition to the core Spark API and functionality, the Spark project contains another major library (in the same way as MLlib is a major project library) called Spark Streaming, which focuses on processing data streams in real time.
A data stream is a continuous sequence of records. Common examples include activity stream data from a web or mobile application, time-stamped log data, transactional data, and event streams from sensor or device networks.
The batch processing approach typically involves saving the data stream to an intermediate storage system (for example, HDFS or a database) and running a batch process on the saved data. In order to generate up-to-date results, the batch process must be run periodically (for example, daily, hourly, or even...