Initializing SparkContext
This recipe shows how to initialize the SparkContext
object as a part of many Spark applications. SparkContext
is an object which allows us to create the base RDDs. Every Spark application must contain this object to interact with Spark. It is also used to initialize StreamingContext
, SQLContext
and HiveContext
.
Let's see how to initialize SparkContext:
- Invoke spark-shell:
$SPARK_HOME/bin/spark-shell --master <master type>
Spark context available as sc.
- Invoke PySpark:
$SPARK_HOME/bin/pyspark --master <master type>
SparkContext available as sc
- Invoke SparkR:
$SPARK_HOME/bin/sparkR --master <master type>
Spark context is available as sc
- Now, let's initiate
SparkContext
in different standalone applications, such as Scala, Java, and Python:
Scala:
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkContextExample {
def main(args: Array[String]) {
val stocksPath = "hdfs://namenode:9000/stocks.txt"
val conf = new SparkConf().setAppName("Counting
Lines").setMaster("spark://master:7077")
val sc = new SparkContext(conf)
val data = sc.textFile(stocksPath, 2)
val totalLines = data.count()
println("Total number of Lines: %s".format(totalLines))
}
}
Java:
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
public class SparkContextExample {
public static void main(String[] args) {
String stocks = "hdfs://namenode:9000/stocks.txt"
SparkConf conf = new SparkConf().setAppName("Counting
Lines").setMaster("spark://master:7077");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile(stocks);
long totalLines = stocks.count();
System.out.println("Total number of Lines " + totalLines);
}
}
Python:
from pyspark
import SparkContext
stocks = "hdfs://namenode:9000/stocks.txt"
sc = SparkContext("<master URI>", "ApplicationName")
data = sc.textFile(stocks)
totalLines = data.count()
print("Total Lines are: %i" % (totalLines))
In the preceding code snippets, new SparkContext(conf)
, new JavaSparkContext(conf)
, and SparkContext("<master URI>", "ApplicationName")
initialize SparkContext in three different languages: Scala, Java, and Python. SparkContext is the starting point for Spark functionality. It represents the connection to a Spark Cluster, and can be used to create RDDs, accumulators, and broadcast variables on that cluster.
SparkContext is created on the driver. It connects with the cluster. Initially, RDDs are created using SparkContext. It is not serialized. Hence it cannot be shipped to workers. Also, only one SparkContext is available per application. In the case of Streaming applications and Spark SQL modules, StreamingContext and SQLContext are created on top of SparkContext.