The SparkSession--your gateway to structured data processing
The SparkSession is the starting point for working with columnar data in Apache Spark. It replaces SQLContext
used in previous versions of Apache Spark. It was created from the Spark context and provides the means to load and save data files of different types using DataFrames and Datasets and manipulate columnar data with SQL, among other things. It can be used for the following functions:
- Executing SQL via the
sql
method - Registering user-defined functions via the
udf
method - Caching
- Creating DataFrames
- Creating Datasets
Note
The examples in this chapter are written in Scala as we prefer the language, but you can develop in Python, R, and Java as well. As stated previously, the SparkSession is created from the Spark context.
Using the SparkSession allows you to implicitly convert RDDs into DataFrames or Datasets. For instance, you can convert RDD
into a DataFrame or Dataset by calling the toDF
or toDS
methods:
import spark.implicits._...