Chapter 4. Spark SQL
We've had a roller coaster ride so far. In the last chapter, we looked at performing ELT with Spark, and most importantly, loading and saving data from and to various data sources. We've looked at structured data streams and NoSQL databases, and during all that time we have tried to keep our attention on using RDDs to work with such data sources. We had slightly touched upon DataFrame and DataSet API, but refrained from going into too much detail around these topics, as we wanted to cover it in full detail in this chapter.
If you have a database background and are still trying to come to terms with RDD API, this is the chapter you'll love the most, as it essentially explains how you can use SQL to exploit the capabilities of the Spark framework.
In this chapter we will be covering the following key topics:
- DataFrame API
- DataSet API
- Catalyst Optimizer
- Spark Session
- Manipulating Spark DataFrames
- Working with Hive, Parquet files, and other databases
Let...