Summary
In this chapter, we discussed Spark SQL as a one-stop solution for processing large data using a mix of SQL-like queries and complex procedural algorithms in-memory, producing results in seconds/minutes but not hours.
We started with the various aspects of Spark SQL including its architecture and various components. We also talked about the complete process of writing Spark SQL jobs in Scala and at the same time, we also talked about various methodologies for converting Spark RDDs into DataFrames. Toward the middle of the chapter, we executed various examples of Spark SQL using different data formats such as Hive/Parquet along with important aspects such as schema evolution and schema merging. Finally at the end, we discussed the various aspects of performance tuning our Spark SQL code/queries.
In the next chapter, we will discuss capturing, processing, and analyzing streaming data using Spark Streaming.