Packt+ | Advance your knowledge in tech

You're reading from Big Data Analysis with Python Combine Spark and Python to unlock the powers of parallel computing and machine learning

Product type Paperback

Published in Apr 2019

Publisher Packt

ISBN-13 9781789955286

Length 276 pages

Edition 1st Edition

Languages

Python

Tools

Combine

Concepts

Big Data

Authors (3):

Ivan Marin

Sarang VK

Ankit Shukla

View More author details

Table of Contents (11) Chapters

Big Data Analysis with Python

Preface

1. The Python Data Science Stack FREE CHAPTER

2. Statistical Visualizations

3. Working with Big Data Frameworks

4. Diving Deeper with Spark

5. Handling Missing Values and Correlation Analysis

6. Exploratory Data Analysis

7. Reproducibility in Big Data Analysis

8. Creating a Full Analysis Report

Appendix

Chapter 07: Reproducibility in Big Data Analysis

Activity 14: Test normality of data attributes (columns) and carry out Gaussian normalization of non-normally distributed attributes

Import the required libraries and packages in the Jupyter notebook:

import numpy as np
import pandas as pd
import seaborn as sns
import time
import re
import os
import matplotlib.pyplot as plt
sns.set(style="ticks")

Now, import the libraries required for preprocessing:

import sklearn as sk
from scipy import stats
from sklearn import preprocessing

Set the working directory using the following command:
```
os.chdir("/Users/svk/Desktop/packt_exercises")
```
Now, import the dataset into the Spark object:
```
df = pd.read_csv('bank.csv', sep=';')
```

Identify the target variable in the data:

DV = 'y'
df[DV]= df[DV].astype('category')
df[DV] = df[DV].cat.codes

Generate training and testing data using the following command:

msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test = df[~msk]

Create the Y and X data, as illustrated here:

# selecting the target variable (dependent variable) as y
y_train = train[DV]

Drop the DV or y using the drop command:
```
train = train.drop(columns=[DV])
train.head()
```
The output is as follows:
Figure 7.22: Bank dataset
Segment the data numerically and categorically and perform distribution transformation on the numeric data:
```
numeric_df = train._get_numeric_data()
```
Perform data preprocessing on the data.

Now, create a loop to identify the columns with a non-normal distribution using the following command (converting to NumPy arrays for more efficient computation):

numeric_df_array = np.array(numeric_df)
loop_c = -1
col_for_normalization = list()

for column in numeric_df_array.T:
    loop_c+=1
    x = column
    k2, p = stats.normaltest(x) 
    alpha = 0.001
    print("p = {:g}".format(p))
        
    # rules for printing the normality output
    if p < alpha:
        test_result = "non_normal_distr"
        col_for_normalization.append((loop_c)) # applicable if yeo-johnson is used
        
        #if min(x) > 0: # applicable if box-cox is used
            #col_for_normalization.append((loop_c)) # applicable if box-cox is used
        print("The null hypothesis can be rejected: non-normal distribution")
        
    else:
        test_result = "normal_distr"
        print("The null hypothesis cannot be rejected: normal distribution")

The output is as follows:

Figure 7.23: Identifying the columns with a non-linear distribution

Create a PowerTransformer based transformation (box-cox):
```
pt = preprocessing.PowerTransformer(method='yeo-johnson', standardize=True, copy=True)
```
Note
box-cox can handle only positive values.

Apply the power transformation model on the data. Select the columns to normalize:

columns_to_normalize = numeric_df[numeric_df.columns[col_for_normalization]]
names_col = list(columns_to_normalize)

Create a density plot to check the normality:
```
columns_to_normalize.plot.kde(bw_method=3)
```
The output is as follows:
Figure 7.24: Density plot to check the normality

Now, transform the columns to a normal distribution using the following command:

normalized_columns = pt.fit_transform(columns_to_normalize)
normalized_columns = pd.DataFrame(normalized_columns, columns=names_col)

Again, create a density plot to check the normality:
```
normalized_columns.plot.kde(bw_method=3)
```
The output is as follows:
Figure 7.25: Another density plot to check the normality

Use a loop to identify the columns with non-normal distribution on the transformed data:

numeric_df_array = np.array(normalized_columns) 
loop_c = -1

for column in numeric_df_array.T:
    loop_c+=1
    x = column
    k2, p = stats.normaltest(x) 
    alpha = 0.001
    print("p = {:g}".format(p))
        
    # rules for printing the normality output
    if p < alpha:
        test_result = "non_normal_distr"
        print("The null hypothesis can be rejected: non-normal distribution")
        
    else:
        test_result = "normal_distr"
        print("The null hypothesis cannot be rejected: normal distribution")

The output is as follows:

Figure 7.26: Power transformation model to data

Bind the normalized and non-normalized columns. Select the columns not to normalize:

columns_to_notnormalize = numeric_df
columns_to_notnormalize.drop(columns_to_notnormalize.columns[col_for_normalization], axis=1, inplace=True)

Use the following command to bind both the non-normalized and normalized columns:

numeric_df_normalized = pd.concat([columns_to_notnormalize.reset_index(drop=True), normalized_columns], axis=1)
numeric_df_normalized

Figure 7.27: Non-normalized and normalized columns

The rest of the chapter is locked

You're reading from Big Data Analysis with Python Combine Spark and Python to unlock the powers of parallel computing and machine learning

Table of Contents (11) Chapters

Chapter 07: Reproducibility in Big Data Analysis

Activity 14: Test normality of data attributes (columns) and carry out Gaussian normalization of non-normally distributed attributes

Note

Authors (3)

Other recommended products

Personalised recommendations for you

You're reading from Big Data Analysis with Python Combine Spark and Python to unlock the powers of parallel computing and machine learning

Table of Contents (11) Chapters

Chapter 07: Reproducibility in Big Data Analysis

Activity 14: Test normality of data attributes (columns) and carry out Gaussian normalization of non-normally distributed attributes

Note

Unlock this book and the full library FREE for 7 days

Authors (3)

Other recommended products

Personalised recommendations for you