Chapter 1, A First Taste and What's New in Apache Spark V2, provides an overview of Apache Spark, the functionality that is available within its modules, and how it can be extended. It covers the tools available in the Apache Spark ecosystem outside the standard Apache Spark modules for processing and storage. It also provides tips on performance tuning.
Chapter 2, Apache Spark SQL, creates a schema in Spark SQL, shows how data can be queried efficiently using the relational API on DataFrames and Datasets, and explores SQL.
Chapter 3, The Catalyst Optimizer, explains what a cost-based optimizer in database systems is and why it is necessary. You will master the features and limitations of the Catalyst Optimizer in Apache Spark.
Chapter 4, Project Tungsten, explains why Project Tungsten is essential for Apache Spark and also goes on to explain how Memory Management, Cache-aware Computation, and Code Generation are used to speed things up dramatically.
Chapter 5, Apache Spark Streaming, talks about continuous applications using Apache Spark streaming. You will learn how to incrementally process data and create actionable insights.
Chapter 6, Structured Streaming, talks about Structured Streaming – a new way of defining continuous applications using the DataFrame and Dataset APIs.
Chapter 7, Classical MLlib, introduces you to MLlib, the de facto standard for machine learning when using Apache Spark.
Chapter 8, Apache SparkML, introduces you to the DataFrame-based machine learning library of Apache Spark: the new first-class citizen when it comes to high performance and massively parallel machine learning.
Chapter 9, Apache SystemML, introduces you to Apache SystemML, another machine learning library capable of running on top of Apache Spark and incorporating advanced features such as a cost-based optimizer, hybrid execution plans, and low-level operator re-writes.
Chapter 10, Deep Learning on Apache Spark using H20 and DeepLearning4j, explains that deep learning is currently outperforming one traditional machine learning discipline after the other. We have three open source first-class deep learning libraries running on top of Apache Spark, which are H2O, DeepLearning4j, and Apache SystemML. Let's understand what Deep Learning is and how to use it on top of Apache Spark using these libraries.
Chapter 11, Apache Spark GraphX, talks about Graph processing with Scala using GraphX. You will learn some basic and also advanced graph algorithms and how to use GraphX to execute them.
Chapter 12, Apache Spark GraphFrames, discusses graph processing with Scala using GraphFrames. You will learn some basic and also advanced graph algorithms and also how GraphFrames differ from GraphX in execution.
Chapter 13, Apache Spark with Jupyter Notebooks on IBM DataScience Experience, introduces a Platform as a Service offering from IBM, which is completely based on an Open Source stack and on open standards. The main advantage is that you have no vendor lock-in. Everything you learn here can be installed and used in other clouds, in a local datacenter, or on your local laptop or PC.
Chapter 14, Apache Spark on Kubernetes, explains that Platform as a Service cloud providers completely manage the operations part of an Apache Spark cluster for you. This is an advantage but sometimes you have to access individual cluster nodes for debugging and tweaking and you don't want to deal with the complexity that maintaining a real cluster on bare-metal or virtual systems entails. Here, Kubernetes might be the best solution. Therefore, in this chapter, we explain what Kubernetes is and how it can be used to set up an Apache Spark cluster.