Why DataFrames?
Apart from massive, scalable computing capability, big data applications also need a mix of a few more features, such as support for a relational system for interactive data analysis (simple SQL style), heterogeneous data sources, and different storage formats along with different processing techniques.
Though Spark provided a functional programming API to manipulate distributed collections of data, it ended up with tuples (_1, _2, ...). Coding to operate on tuples was a little complicated and messy, and was slow at times. So, a standardized layer was needed, with the following characteristics:
Named columns with a schema (higher-level abstraction than tuples) so that manipulating and tracking them would be easy
Functionality to consolidate data from various data sources such as Hive, Parquet, SQL Server, PostgreSQL, JSON, and also Spark's native RDDs, and unify them to a common format
Ability to take advantage of built-in schemas in special file formats such as Avro, CSV, JSON...