Processing data with PySpark
Before processing data with PySpark, let's run one of the samples to show how Spark works. Then, we will skip the boilerplate in later examples and focus on data processing. The Jupyter notebook for the Pi Estimation example from the Spark website at http://spark.apache.org/examples.html is shown in the following screenshot:
The example from the website will not run without some modifications. In the following points, I will walk through the cells:
- The first cell imports
findspark
and runs theinit()
method. This was explained in the preceding section as the preferred method to include PySpark in Jupyter notebooks. The code is as follows:import findspark findspark.init()
- The next cell imports the
pyspark
library andSparkSession
. It then creates the session by passing the head node of the Spark cluster. You can get the URL from the Spark web UI...