The first thing to do in order to access Spark is to create a SparkContext. The SparkContext initializes all of Spark and sets up any access that may be needed to Hadoop, if you are using that as well.
The initial object used to be a SQLContext, but that has been deprecated recently in favor of SparkContext, which is more open-ended.
We could use a simple example to just read through a text file as follows:
from pyspark import SparkContext sc = SparkContext.getOrCreate() lines = sc.textFile("B05238_04 Spark Total Line Lengths.ipynb") lineLengths = lines.map(lambda s: len(s)) totalLength = lineLengths.reduce(lambda a, b: a + b) print(totalLength)
In this example:
- We obtain a SparkContext
- With the context, read in a file (the Jupyter file for this example)
- We use a Hadoop map function to split up the text file into different lines and gather...