Chapter 4. Big Data Analytics with Spark SQL, DataFrames, and Datasets
As per the Spark Summit presentation by Matei Zaharia, creator of Apache Spark (http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote), Spark SQL and DataFrames are the most used components of an entire Spark ecosystem. This indicates Spark SQL is one of the key components used for Big Data Analytics by companies.
Users of Spark have three different APIs to interact with distributed collections of data:
- RDD API allows users to work with objects of their choice and express transformations as lambda functions
- DataFrames API provides high-level relational operations and an optimized runtime, at the expense of type-safety
- Dataset API that combines the worlds of RDD and DataFrames
We have learned how to use RDD API in Chapter 3, Deep Dive into Apache Spark. In this chapter, let's understand the in-depth concepts of Spark SQL including exploring the Data Sources API, the DataFrame API, the...