Creating a SparkSession for PySpark
Previously introduced in Chapter 1, PySpark is a Spark library that was designed to work with Python. PySpark uses a Python API to write Spark functionalities such as data manipulation, processing (batch or real-time), and machine learning.
However, before ingesting or processing data using PySpark, we must initialize a SparkSession. This recipe will teach us how to create a SparkSession using PySpark and explain its importance.
Getting ready
We first need to ensure we have the correct PySpark version. We installed PySpark in Chapter 1; however, checking if we are using the correct version is always good. Run the following command:
$ pyspark –version
You should see the following output:
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _...