Packt+ | Advance your knowledge in tech

You're reading from Data Science with Python Combine Python with machine learning principles to discover hidden patterns in raw data

Product type Paperback

Published in Jul 2019

Publisher Packt

ISBN-13 9781838552862

Length 426 pages

Edition 1st Edition

Languages

Python

Tools

Combine

Concepts

Data Science

Authors (3):

Rohan Chopra

Mohamed Noordeen Alaudeen

Aaron England

View More author details

Table of Contents (10) Chapters

About the Book

1. Introduction to Data Science and Data Pre-Processing FREE CHAPTER

2. Data Visualization

3. Introduction to Machine Learning via Scikit-Learn

4. Dimensionality Reduction and Unsupervised Learning

5. Mastering Structured Data

6. Decoding Images

7. Processing Human Language

8. Tips and Tricks of the Trade

1. Appendix

Chapter 1: Introduction to Data Science and Data Preprocessing

Activity 1: Pre-Processing Using the Bank Marketing Subscription Dataset

Solution

Let's perform various pre-processing tasks on the Bank Marketing Subscription dataset. We'll also be splitting the dataset into training and testing data. Follow these steps to complete this activity:

Open a Jupyter notebook and add a new cell to import the pandas library and load the dataset into a pandas dataframe. To do so, you first need to import the library, and then use the pd.read_csv() function, as shown here:
import pandas as pd
Link = 'https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/Banking_Marketing.csv'
#reading the data into the dataframe into the object data
df = pd.read_csv(Link, header=0)
To find the number of rows and columns in the dataset, add the following code:
#Finding number of rows and columns
print("Number of rows and columns : ",df.shape)
The preceding code generates the following output:
Figure 1.60: Number of rows and columns in the dataset
To print the list of all columns, add the following code:
#Printing all the columns
print(list(df.columns))
The preceding code generates the following output:
Figure 1.61: List of columns present in the dataset
To overview the basic statistics of each column, such as the count, mean, median, standard deviation, minimum value, maximum value, and so on, add the following code:
#Basic Statistics of each column
df.describe().transpose()
The preceding code generates the following output:
Figure 1.62: Basic statistics of each column
To print the basic information of each column, add the following code:
#Basic Information of each column
print(df.info())
The preceding code generates the following output:
Figure 1.63: Basic information of each column
In the preceding figure, you can see that none of the columns contains any null values. Also, the type of each column is provided.
Now let's check for missing values and the type of each feature. Add the following code to do this:
#finding the data types of each column and checking for null
null_ = df.isna().any()
dtypes = df.dtypes
sum_na_ = df.isna().sum()
info = pd.concat([null_,sum_na_,dtypes],axis = 1,keys = ['isNullExist','NullSum','type'])
info
Have a look at the output for this in the following figure:
Figure 1.64: Information of each column stating the number of null values and the data types
Since we have loaded the dataset into the data object, we will remove the null values from the dataset. To remove the null values from the dataset, add the following code:
#removing Null values
df = df.dropna()
#Total number of null in each column
print(df.isna().sum())# No NA
Have a look at the output for this in the following figure:
Figure 1.65: Features of dataset with no null values
Now we check the frequency distribution of the education column in the dataset. Use the value_counts() function to implement this:
df.education.value_counts()
Have a look at the output for this in the following figure:
Figure 1.66: Frequency distribution of the education column
In the preceding figure, we can see that the education column of the dataset has many categories. We need to reduce the categories for better modeling. To check the various categories in the education column, we use the unique() function. Type the following code to implement this:
df.education.unique()
The output is as follows:
Figure 1.67: Various categories of the education column
Now let's group the basic.4y, basic.9y, and basic.6y categories together and call them basic. To do this, we can use the replace function from pandas:
df.education.replace({"basic.9y":"Basic","basic.6y":"Basic","basic.4y":"Basic"},inplace=True)
To check the list of categories after grouping, add the following code:
df.education.unique()
Figure 1.68: Various categories of the education column
In the preceding figure, you can see that basic.9y, basic.6y, and basic.4y are grouped together as Basic.
Now we select and perform a suitable encoding method for the data. Add the following code to implement this:
#Select all the non numeric data using select_dtypes function
data_column_category = df.select_dtypes(exclude=[np.number]).columns
The preceding code generates the following output:
Figure 1.69: Various columns of the dataset
Now we define a list with all the names of the categorical features in the data. Also, we loop through every variable in the list, getting dummy variable encoded output. Add the following code to do this:
cat_vars=data_column_category
for var in cat_vars:
    cat_list='var'+'_'+var
    cat_list = pd.get_dummies(df[var], prefix=var)
    data1=df.join(cat_list)
    df=data1
df.columns
The preceding code generates the following output:
Figure 1.70: List of categorical features in the data
Now we neglect the categorical column for which we have done encoding. We'll select only the numerical and encoded categorical columns. Add the code to do this:
#Categorical features
cat_vars=data_column_category
#All features
data_vars=df.columns.values.tolist()
#neglecting the categorical column for which we have done encoding
to_keep = []
for i in data_vars:
    if i not in cat_vars:
        to_keep.append(i)

#selecting only the numerical and encoded catergorical column
data_final=df[to_keep]
data_final.columns
The preceding code generates the following output:
Figure 1.71: List of numerical and encoded categorical columns
Finally, we split the data into train and test sets. Add the following code to implement this:
#Segregating Independent and Target variable
X=data_final.drop(columns='y')
y=data_final['y']
from sklearn. model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print("FULL Dateset X Shape: ", X.shape )
print("Train Dateset X Shape: ", X_train.shape )
print("Test Dateset X Shape: ", X_test.shape )
The output is as follows: