Chapter 3. Introduction to DataFrames
To solve any real-world big data analytics problem, access to an efficient and scalable computing system is definitely mandatory. However, if the computing power is not accessible to the target users in a way that's easy and familiar to them, it will barely make any sense. Interactive data analysis gets easier with datasets that can be represented as named columns, which was not the case with plain RDDs. So, the need for a schema-based approach to represent data in a standardized way was the inspiration behind DataFrames.
The previous chapter outlined some design aspects of Spark. We learnt how Spark enabled distributed data processing on distributed collections of data (RDDs) through in-memory computation. It covered most of the points that revealed Spark as a fast, efficient, and scalable computing platform. In this chapter, we will see how Spark introduced the DataFrame API to make data scientists feel at home to carry out their usual...