Resilient distributed datasets
Spark expresses all computations as a sequence of transformations and actions on distributed collections, called Resilient Distributed Datasets (RDD). Let's explore how RDDs work with the Spark shell. Navigate to the examples directory and open a Spark shell as follows:
$ spark-shell scala>
Let's start by loading an email in an RDD:
scala> val email = sc.textFile("ham/9-463msg1.txt") email: rdd.RDD[String] = MapPartitionsRDD[1] at textFile
email
is an RDD, with each element corresponding to a line in the input file. Notice how we created the RDD by calling the textFile
method on an object called sc
:
scala> sc spark.SparkContext = org.apache.spark.SparkContext@459bf87c
sc
is a SparkContext
instance, an object representing the entry point to the Spark cluster (for now, just our local machine). When we start a Spark shell, a context is created and bound to the variable sc
automatically.
Let's split the email into words using flatMap
:
scala> val words...