Introduction
In the previous chapter, we learned the basic concepts of Spark DataFrames and saw how to leverage them for big data analysis.
In this chapter, we will go a step further and learn about handling missing values in data and correlation analysis with Spark DataFrames—concepts that will help us with data preparation for machine learning and exploratory data analysis.
We will briefly cover these concepts to provide the reader with some context, but our focus is on their implementation with Spark DataFrames. We will use the same Iris dataset that we used in the previous chapter for the exercises in this chapter as well. But the Iris dataset has no missing values, so we have randomly removed two entries from the Sepallength column and one entry from the Petallength column from the original dataset. So, now we have a dataset with missing values, and we will learn how to handle these missing values using PySpark.
We will also look at the correlation between the variables in the Iris dataset...