SparkSession versus SparkContext
You would have noticed that we are using SparkSession
and SparkContext
, and this is not an error. Let's revisit the annals of Spark history for a perspective. It is important to understand where we came from, as you will hear about these connection objects for some time to come.
Prior to Spark 2.0.0, the three main connection objects were SparkContext
, SqlContext
, and HiveContext
. The SparkContext
object was the connection to a Spark execution environment and created RDDs and others, SQLContext
worked with SparkSQL in the background of SparkContext
, and HiveContext
interacted with the Hive
stores.
Spark 2.0.0 introduced Datasets/DataFrames as the main distributed data abstraction interface and the SparkSession
object as the entry point to a Spark execution environment. Appropriately, the SparkSession
object is found in the namespace, org.apache.spark.sql.SparkSession
(Scala), or pyspark.sql.sparkSession
. A few points to note are as follows:
- In Scala...