Initializing SparkContext
This recipe shows how to initialize the SparkContext
object as a part of many Spark applications. SparkContext
is an object which allows us to create the base RDDs. Every Spark application must contain this object to interact with Spark. It is also used to initialize StreamingContext
, SQLContext
and HiveContext
.
Getting ready
To step through this recipe, you will need a running Spark Cluster in any one of the modes that is, local, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer to http://spark.apache.org/docs/latest/spark-standalone.html. Install Hadoop (optional), Scala, and Java. Please download the data from the following location:
https://github.com/ChitturiPadma/datasets/blob/master/stocks.txt
How to do it…
Let's see how to initialize SparkContext:
- Invoke spark-shell:
$SPARK_HOME/bin/spark-shell --master <master type> Spark context available as sc.
- Invoke PySpark:
$SPARK_HOME/bin/pyspark --master <master type> SparkContext available as sc
- Invoke SparkR:
$SPARK_HOME/bin/sparkR --master <master type> Spark context is available as sc
- Now, let's initiate
SparkContext
in different standalone applications, such as Scala, Java, and Python:
Scala:
import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SparkContextExample { def main(args: Array[String]) { val stocksPath = "hdfs://namenode:9000/stocks.txt" val conf = new SparkConf().setAppName("Counting Lines").setMaster("spark://master:7077") val sc = new SparkContext(conf) val data = sc.textFile(stocksPath, 2) val totalLines = data.count() println("Total number of Lines: %s".format(totalLines)) } }
Java:
import org.apache.spark.api.java.*; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.Function; public class SparkContextExample { public static void main(String[] args) { String stocks = "hdfs://namenode:9000/stocks.txt" SparkConf conf = new SparkConf().setAppName("Counting Lines").setMaster("spark://master:7077"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logData = sc.textFile(stocks); long totalLines = stocks.count(); System.out.println("Total number of Lines " + totalLines); } }
Python:
from pyspark import SparkContext stocks = "hdfs://namenode:9000/stocks.txt" sc = SparkContext("<master URI>", "ApplicationName") data = sc.textFile(stocks) totalLines = data.count() print("Total Lines are: %i" % (totalLines))
How it works…
In the preceding code snippets, new SparkContext(conf)
, new JavaSparkContext(conf)
, and SparkContext("<master URI>", "ApplicationName")
initialize SparkContext in three different languages: Scala, Java, and Python. SparkContext is the starting point for Spark functionality. It represents the connection to a Spark Cluster, and can be used to create RDDs, accumulators, and broadcast variables on that cluster.
There's more…
SparkContext is created on the driver. It connects with the cluster. Initially, RDDs are created using SparkContext. It is not serialized. Hence it cannot be shipped to workers. Also, only one SparkContext is available per application. In the case of Streaming applications and Spark SQL modules, StreamingContext and SQLContext are created on top of SparkContext.
See also
To understand more about the SparkContext object and its methods, please refer to this documentation page: https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.SparkContext.