Summary
This chapter should be regarded as an invitation to explore the capabilities of the Apache Spark framework in a single host than a large deployment environment.
Beyond the introduction to the key components of the Apache Spark framework and the concept of resilient distributed datasets, data frames and datasets, we learned how to leverage RDDs for data clustering using the K-means algorithm and create data frames and reusable ML pipelines for encoding observations and training models.
We also experimented with extending the Spark library with new functionality such as the Kullback-Leibler divergence and leveraging the Spark streaming library for pseudo-real-time data processing as applied to the data parsing.