Using Spark to analyze data
The first thing to do in order to access Spark is to create a SparkContext
. The SparkContext
initializes all of Spark and sets up any access that may be needed to Hadoop, if you are using that as well.
The initial object used to be a SQLContext
, but that has been deprecated recently in favor of SparkContext
, which is more open-ended.
We could use a simple example to just read through a text file as follows:
from pyspark import SparkContextsc = SparkContext.getOrCreate()lines = sc.textFile("B05238_04 Spark Total Line Lengths.ipynb")lineLengths = lines.map(lambda s: len(s))totalLength = lineLengths.reduce(lambda a, b: a + b)print(totalLength)
In this example:
- We obtain a
SparkContext
- With the context, read in a file (the Jupyter file for this example)
- We use a Hadoop
map
function to split up the text file into different lines and gather the lengths - We use a Hadoop
reduce
function to calculate the length of all the lines - We display our results
Under Jupyter this looks like...