Why DataFrames?
Apart from massive, scalable computing capability, big data applications also need a mix of a few more features, such as support for a relational system for interactive data analysis (simple SQL style), heterogeneous data sources, and different storage formats along with different processing techniques.
Though Spark provided a functional programming API to manipulate distributed collections of data, it ended up with tuples (_1, _2, ...). Coding to operate on tuples was a little complicated and messy, and was slow at times. So, a standardized layer was needed, with the following characteristics:
- Named columns with a schema (higher-level abstraction than tuples) so that manipulating and tracking them would be easy
- Functionality to consolidate data from various data sources such as Hive, Parquet, SQL Server, PostgreSQL, JSON, and also Spark's native RDDs, and unify them to a common format
- Ability to take advantage of built-in schemas in special file formats such as Avro,...