In this chapter, we discussed the origin of DataFrames and how Spark SQL provides the SQL interface on top of DataFrames. The power of DataFrames is such that execution times have decreased manyfold over original RDD-based computations. Having such a powerful layer with a simple SQL-like interface makes them all the more powerful. We also looked at various APIs to create, and manipulate DataFrames, as well as digging deeper into the sophisticated features of aggregations, including groupBy, Window, rollup, and cubes. Finally, we also looked at the concept of joining datasets and the various types of joins possible, such as inner, outer, cross, and so on.
In the next chapter, we will explore the exciting world of real-time data processing and analytics in the Chapter 9, Stream Me Up, Scotty - Spark Streaming.