Handling Missing Values in Spark DataFrames
Missing value handling is one of the complex areas of data science. There are a variety of techniques that are used to handle missing values depending on the type of missing data and the business use case at hand.
These methods range from simple logic-based methods to advanced statistical methods such as regression and KNN. However, irrespective of the method used to tackle the missing values, we will end up performing one of the following two operations on the missing value data:
Removing the records with missing values from the data
Imputing the missing value entries with some constant value
In this section, we will explore how to do both these operations with PySpark DataFrames.
Exercise 40: Removing Records with Missing Values from a DataFrame
In this exercise, we will remove the records containing missing value entries for the PySpark DataFrame. Let's perform the following steps:
To remove the missing values from a particular column, use the following...