Learning Spark core concepts
Let's understand the core concepts of Spark in this section. The main abstraction Spark provides is a Resilient Distributed Dataset (RDD). So, let's understand what an RDD is and operations in RDDs that provide in-memory performance and fault tolerance. But, let's learn the ways to work with Spark first.
Ways to work with Spark
There are a couple of ways to work with Spark—Spark Shell and Spark Applications.
Spark Shell
Interactive REPL (read-eval-print loop) for data exploration using Scala, Python, or R:
// Entering to Scala Shell . :q to exit the shell. [cloudera@quickstart spark-2.0.0-bin-hadoop2.7]$ bin/spark-shell # Entering to Python Shell. ctrl+d to exit the shell. [cloudera@quickstart spark-2.0.0-bin-hadoop2.7]$ bin/pyspark // Entering to R Shell. Need to install R first. ctrl+d to exit shell [cloudera@quickstart spark-2.0.0-bin-hadoop2.7]$ bin/sparkR
For a complete list of spark-shell options, use the following command.
[cloudera...