Ratings histogram walk-through
Remember the RatingsHistogram
code that we ran for your first Spark program? Well, let's take a closer look at that and figure out what's actually going on under the hood with it. Understanding concepts is all well and good, but nothing beats looking at some real examples. Let's go back to the RatingsHistogram
example that we started off with in this book. We'll break it down and understand exactly what it's doing under the hood and how it's using our RDDs to actually get the results for the RatingsHistogram
data.
Understanding the code
The first couple of lines are just boilerplate stuff. One thing you'll see in every Python Spark script is the import statement to import SparkConf
and SparkContext
from the pyspark
library that Spark includes. You will, at a minimum, need those two objects:
from pyspark import SparkConf, SparkContext import collections
SparkContext
, as we talked about earlier, is the fundamental starting point that the Spark framework gives you...