Getting started with Spark
In this first section, we will learn how to get Spark up and running on our local machine. We will also get an overview of Spark’s architecture and some of its core concepts. This will set the foundation for the more practical data processing sections later in the chapter.
Installing Spark locally
Installing Spark nowadays is as easy as a pip3
install
command:
- After you have installed Java 8, run the following command:
pip3 install pyspark
- This will install PySpark along with its dependencies, such as Spark itself. You can test whether the installation was successful by running this command in a terminal:
spark-submit --version
You should see a simple output with the Spark logo and Spark version in your terminal.
Spark architecture
Spark follows a distributed/cluster architecture, as you can see in the following figure:
Figure 5.1 – Spark cluster architecture
The centerpiece that coordinates...