Chapter 05: Missing Value Handling and Correlation Analysis in Spark
Activity 12: Missing Value Handling and Correlation Analysis with PySpark DataFrames
Import the required libraries and modules in the Jupyter notebook, as illustrated here:
import findspark findspark.init() import pyspark import random
Set up the SparkContext with the help of the following command in the Jupyter notebook:
sc = pyspark.SparkContext(appName = "chapter5")
Similarly, set up the SQLContext in the notebook:
from pyspark.sql import SQLContext sqlc = SQLContext(sc)
Now, read the CSV data into a Spark object using the following command:
df = sqlc.read.format('com.databricks.spark.csv').options(header = 'true', inferschema = 'true').load('iris.csv') df.show(5)
The output is as follows:
Fill in the missing values in the Sepallength column with the column's mean.
First, calculate the mean of the Sepallength column using the following command:
from pyspark.sql.functions import mean avg_sl = df.select(mean('Sepallength')).toPandas()['avg(Sepallength)']
Now, impute the missing values in the Sepallength column with the column's mean, as illustrated here:
y = df y = y.na.fill(float(avg_sl),['Sepallength']) y.describe().show(1)
The output is as follows:
Compute the correlation matrix for the dataset. Make sure to import the required modules, as shown here:
from pyspark.mllib.stat import Statistics import pandas as pd
Now, fill the missing values in the DataFrame before computing the correlation:
z = y.fillna(1)
Next, remove the String columns from the PySpark DataFrame, as illustrated here:
a = z.drop('Species') features = a.rdd.map(lambda row: row[0:])
Now, compute the correlation matrix in Spark:
correlation_matrix = Statistics.corr(features, method="pearson")
Next, convert the correlation matrix into a pandas DataFrame using the following command:
correlation_df = pd.DataFrame(correlation_matrix) correlation_df.index, correlation_df.columns = a.columns, a.columns correlation_df
The output is as follows:
Plot the variable pairs showing strong positive correlation and fit a linear line on them.
First, load the data from the Spark DataFrame into a pandas DataFrame:
import pandas as pd dat = y.toPandas() type(dat)
The output is as follows:
pandas.core.frame.DataFrame
Next, load the required modules and plotting data using the following commands:
import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline sns.lmplot(x = "Sepallength", y = "Petallength", data = dat) plt.show()
The output is as follows:
Plot the graph so that x equals Sepallength, and y equals Petalwidth:
import seaborn as sns sns.lmplot(x = "Sepallength", y = "Petalwidth", data = dat) plt.show()
The output is as follows:
Plot the graph so that x equals Petalwidth and y equals Petalwidth:
sns.lmplot(x = "Petallength", y = "Petalwidth", data = dat) plt.show()
The output is as follows: