What is DataFrame API?
I believe before looking at what a DataFrame API is, we should probably review what an RDD is and identify what could possibly be improved on the RDD interface. RDD has been the user facing API in Apache Spark since its inception and as discussed earlier can represent unstructured data, is compile-time safe, has dependencies, is evaluated lazily, and represents a distributed collection of data across a Spark cluster. RDDs can have partitions, which can be aided by locality info, thus aiding Spark scheduler to allow the computation to be performed on the machines where the data is already available to reduce the costly network overload.
However from a programming perspective, the computation itself is less transparent, as Spark doesn't know what you are doing, for example, join/filters, and so on. They express the how of a solution better than the what of a solution. The data itself is opaque to the optimizer, which means Spark gets an object either in Scala...