Summary
This chapter provided you with an understanding of Apache Spark. We began with the context of what problems Hadoop and MapReduce resolved, and the gaps that remain. Spark addresses the issue of iterative processing for machine learning algorithms and supports real-time querying and processing of streaming data. We introduced the concept of RDD, which is the core construct of Spark. We also learned how to use the Databricks platform and launch clusters and notebooks in it. We then moved to understanding transformations and actions, which form the key execution steps. Using a combination of transformations and actions, it is possible to create a pipeline. We covered several examples of transformations and actions and how to use them. We learned about transformations, including map, filter, union, and intersection, and also learned how to use actions such as count, collect, reduce, first, and take. We then touched on some of the best practices to keep in mind when using Spark...