Chapter 17. Apache Spark MLlib
The previous Chapter 16, Parallelism with Scala and Akka, provided the reader with different options to make computing-intensive applications scalable. These solutions are generic and do not address the specific needs of the data scientist. Optimization algorithms such as minimization of loss function or dynamic programming methods such as the Viterbi algorithm require support for caching data and broadcasting of model parameters. The Apache Spark framework addresses these shortcomings.
This chapter describes the key concepts behind the Apache Spark framework and its application to large scale machine learning problems. The reader is invited to dig into the wealth of books and papers on this topic. This last chapter describes four key characteristics of the Apache Spark framework:
- MLlib functionality as illustrated with the K-means algorithm
- Reusable ML pipelines, introduced in Spark 2.0
- Extensibility of existing Spark functionality using Kullback-Leibler...