DataFrame API
The motivation of Spark DataFrames comes from Pandas DataFrames in Python. A DataFrame is essentially a set of rows and columns. You can think of it like a table where you have table headers as column names and below these headers are data arranged accordingly. This table-like format has been part of computations for a long time in tools such as relational databases and comma-separated files.
Spark’s DataFrame API is built on top of RDDs. The underlying structures to store the data are still RDDs but DataFrames create an abstraction on top of the RDDs to hide its complexity. Just as RDDs are lazily evaluated and are immutable, DataFrames are also evaluated lazily and are immutable. If you can remember from previous chapters, lazy evaluation gives Spark performance gains and optimization by running the computations only when needed. This also gives Spark a large number of optimizations in its DataFrames by planning how to best compute the operations. The computations...