Getting Started with Spark DataFrames
To get started with Spark DataFrames, we will have to create something called a SparkContext first. SparkContext configures the internal services under the hood and facilitates command execution from the Spark execution environment.
Note
We will be using Spark version 2.1.1, running on Python 3.7.1. Spark and Python are installed on a MacBook Pro, running macOS Mojave version 10.14.3, with a 2.7 GHz Intel Core i5 processor and 8 GB 1867 MHz DDR3 RAM.
The following code snippet is used to create SparkContext:
from pyspark import SparkContext sc = SparkContext()
Note
In case you are working in the PySpark shell, you should skip this step, as the shell automatically creates the sc (SparkContext) variable when it is started. However, be sure to create the sc variable while creating a PySpark script or working with Jupyter Notebook, or your code will throw an error.
We also need to create an SQLContext before we can start working with DataFrames. SQLContext in Spark...