Spark SQL
Executing SQL queries for basic business needs is very common and almost every business does it using some kind of database. So Spark SQL also supports the execution of SQL queries written using either a basic SQL syntax or HiveQL. Spark SQL can also be used to read data from an existing Hive installation. Apart from these plain SQL operations, Spark SQL also addresses some tough problems. Designing complex logic through relational queries was cumbersome and almost impossible at times. So, Spark SQL was designed to integrate the capabilities of relational processing and functional programming so that complex logics can be implemented, optimized, and scaled on a distributed computing setup. There are basically three ways to interact with Spark SQL, including SQL, the DataFrame API, and the Dataset API. The Dataset API is an experimental layer added in Spark 1.6 at the time of writing this book so we will limit our discussions to DataFrames only.
Spark SQL exposes DataFrames as a...