Chapter 6. Machine Learning with Spark
We have spent a considerable amount of time understanding the architecture of Spark, RDDs, DataFrames and Dataset-based APIs, Spark SQL, and Streaming, all of which was primarily related to building the foundations of what we are going to discuss in this chapter, which is machine learning. Our focus has been on getting the data onto the Spark platform either in batch or in streaming fashion, and transforming it into the desired state.
Once you have the data in the platform, what do you do with it? You can either use it for reporting purposes, building dashboards, or letting your data scientists analyze the data to detect patterns, identify reasons for specific events, understand the behavior of customers, group them into segments to aid better decision making, or predict the future.
The power of Spark's MLLib stems from the fact that it lets you operate your algorithms over a distributed dataset, which can sometimes be its weakness too...