Chapter 07: Reproducibility in Big Data Analysis
Activity 14: Test normality of data attributes (columns) and carry out Gaussian normalization of non-normally distributed attributes
Import the required libraries and packages in the Jupyter notebook:
import numpy as np import pandas as pd import seaborn as sns import time import re import os import matplotlib.pyplot as plt sns.set(style="ticks")
Now, import the libraries required for preprocessing:
import sklearn as sk from scipy import stats from sklearn import preprocessing
Set the working directory using the following command:
os.chdir("/Users/svk/Desktop/packt_exercises")
Now, import the dataset into the Spark object:
df = pd.read_csv('bank.csv', sep=';')
Identify the target variable in the data:
DV = 'y' df[DV]= df[DV].astype('category') df[DV] = df[DV].cat.codes
Generate training and testing data using the following command:
msk = np.random.rand(len(df)) < 0.8 train = df[msk] test = df[~msk]
Create the Y and X data, as illustrated here:
# selecting the target variable (dependent variable) as y y_train = train[DV]
Drop the DV or y using the drop command:
train = train.drop(columns=[DV]) train.head()
The output is as follows:
Segment the data numerically and categorically and perform distribution transformation on the numeric data:
numeric_df = train._get_numeric_data()
Perform data preprocessing on the data.
Now, create a loop to identify the columns with a non-normal distribution using the following command (converting to NumPy arrays for more efficient computation):
numeric_df_array = np.array(numeric_df) loop_c = -1 col_for_normalization = list() for column in numeric_df_array.T: loop_c+=1 x = column k2, p = stats.normaltest(x) alpha = 0.001 print("p = {:g}".format(p)) # rules for printing the normality output if p < alpha: test_result = "non_normal_distr" col_for_normalization.append((loop_c)) # applicable if yeo-johnson is used #if min(x) > 0: # applicable if box-cox is used #col_for_normalization.append((loop_c)) # applicable if box-cox is used print("The null hypothesis can be rejected: non-normal distribution") else: test_result = "normal_distr" print("The null hypothesis cannot be rejected: normal distribution")
The output is as follows:
Create a PowerTransformer based transformation (box-cox):
pt = preprocessing.PowerTransformer(method='yeo-johnson', standardize=True, copy=True)
Note
box-cox can handle only positive values.
Apply the power transformation model on the data. Select the columns to normalize:
columns_to_normalize = numeric_df[numeric_df.columns[col_for_normalization]] names_col = list(columns_to_normalize)
Create a density plot to check the normality:
columns_to_normalize.plot.kde(bw_method=3)
The output is as follows:
Now, transform the columns to a normal distribution using the following command:
normalized_columns = pt.fit_transform(columns_to_normalize) normalized_columns = pd.DataFrame(normalized_columns, columns=names_col)
Again, create a density plot to check the normality:
normalized_columns.plot.kde(bw_method=3)
The output is as follows:
Use a loop to identify the columns with non-normal distribution on the transformed data:
numeric_df_array = np.array(normalized_columns) loop_c = -1 for column in numeric_df_array.T: loop_c+=1 x = column k2, p = stats.normaltest(x) alpha = 0.001 print("p = {:g}".format(p)) # rules for printing the normality output if p < alpha: test_result = "non_normal_distr" print("The null hypothesis can be rejected: non-normal distribution") else: test_result = "normal_distr" print("The null hypothesis cannot be rejected: normal distribution")
The output is as follows:
Bind the normalized and non-normalized columns. Select the columns not to normalize:
columns_to_notnormalize = numeric_df columns_to_notnormalize.drop(columns_to_notnormalize.columns[col_for_normalization], axis=1, inplace=True)
Use the following command to bind both the non-normalized and normalized columns:
numeric_df_normalized = pd.concat([columns_to_notnormalize.reset_index(drop=True), normalized_columns], axis=1) numeric_df_normalized