As we have seen, Spark Streaming makes it easy to work with data streams in a way that should be familiar to us from working with RDDs. Using Spark's stream processing primitives combined with the online learning capabilities of ML Library SGD-based methods, we can create real-time machine learning models that we can update on new data in the stream as it arrives.
Online learning with Spark Streaming
Streaming regression
Spark provides a built-in streaming machine learning model in the StreamingLinearAlgorithm class. Currently, only a linear regression implementation is available-StreamingLinearRegressionWithSGD-but future versions will include classification.
The streaming regression model provides two methods for usage:
- trainOn: This takes DStream[LabeledPoint...