Summary
In this chapter, we saw a basic introduction of Spark DataFrames and how they are better than RDDs. We explored different ways of creating Spark DataFrames and writing the contents of Spark DataFrames to regular pandas DataFrames and output files.
We tried out hands-on data exploration in PySpark by computing basic statistics and metrics for Spark DataFrames. We played around with the data in Spark DataFrames and performed data manipulation operations such as filtering, selection, and aggregation. We tried our hands at plotting the data to generate insightful visualizations.
Furthermore, we consolidated our understanding of various concepts by practicing hands-on exercises and activities.
In the next chapter, we will explore how to handle missing values and compute correlation between variables in PySpark.