Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Science  with Python

You're reading from   Data Science with Python Combine Python with machine learning principles to discover hidden patterns in raw data

Arrow left icon
Product type Paperback
Published in Jul 2019
Publisher Packt
ISBN-13 9781838552862
Length 426 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (3):
Arrow left icon
Rohan Chopra Rohan Chopra
Author Profile Icon Rohan Chopra
Rohan Chopra
Mohamed Noordeen Alaudeen Mohamed Noordeen Alaudeen
Author Profile Icon Mohamed Noordeen Alaudeen
Mohamed Noordeen Alaudeen
Aaron England Aaron England
Author Profile Icon Aaron England
Aaron England
Arrow right icon
View More author details
Toc

Table of Contents (10) Chapters Close

About the Book 1. Introduction to Data Science and Data Pre-Processing FREE CHAPTER 2. Data Visualization 3. Introduction to Machine Learning via Scikit-Learn 4. Dimensionality Reduction and Unsupervised Learning 5. Mastering Structured Data 6. Decoding Images 7. Processing Human Language 8. Tips and Tricks of the Trade 1. Appendix

Chapter 1: Introduction to Data Science and Data Preprocessing

Activity 1: Pre-Processing Using the Bank Marketing Subscription Dataset

Solution

Let's perform various pre-processing tasks on the Bank Marketing Subscription dataset. We'll also be splitting the dataset into training and testing data. Follow these steps to complete this activity:

  1. Open a Jupyter notebook and add a new cell to import the pandas library and load the dataset into a pandas dataframe. To do so, you first need to import the library, and then use the pd.read_csv() function, as shown here:

    import pandas as pd

    Link = 'https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/Banking_Marketing.csv'

    #reading the data into the dataframe into the object data

    df = pd.read_csv(Link, header=0)

  2. To find the number of rows and columns in the dataset, add the following code:

    #Finding number of rows and columns

    print("Number of rows and columns : ",df.shape)

    The preceding code generates the following output:

    Figure 1.60: Number of rows and columns in the dataset
    Figure 1.60: Number of rows and columns in the dataset
  3. To print the list of all columns, add the following code:

    #Printing all the columns

    print(list(df.columns))

    The preceding code generates the following output:

    Figure 1.61: List of columns present in the dataset
    Figure 1.61: List of columns present in the dataset
  4. To overview the basic statistics of each column, such as the count, mean, median, standard deviation, minimum value, maximum value, and so on, add the following code:

    #Basic Statistics of each column

    df.describe().transpose()

    The preceding code generates the following output:

    Figure 1.62: Basic statistics of each column
    Figure 1.62: Basic statistics of each column
  5. To print the basic information of each column, add the following code:

    #Basic Information of each column

    print(df.info())

    The preceding code generates the following output:

    Figure 1.63: Basic information of each column
    Figure 1.63: Basic information of each column

    In the preceding figure, you can see that none of the columns contains any null values. Also, the type of each column is provided.

  6. Now let's check for missing values and the type of each feature. Add the following code to do this:

    #finding the data types of each column and checking for null

    null_ = df.isna().any()

    dtypes = df.dtypes

    sum_na_ = df.isna().sum()

    info = pd.concat([null_,sum_na_,dtypes],axis = 1,keys = ['isNullExist','NullSum','type'])

    info

    Have a look at the output for this in the following figure:

    Figure 1.64: Information of each column stating the number of null values and the data types
    Figure 1.64: Information of each column stating the number of null values and the data types
  7. Since we have loaded the dataset into the data object, we will remove the null values from the dataset. To remove the null values from the dataset, add the following code:

    #removing Null values

    df = df.dropna()

    #Total number of null in each column

    print(df.isna().sum())# No NA

    Have a look at the output for this in the following figure:

    Figure 1.65: Features of dataset with no null values
    Figure 1.65: Features of dataset with no null values
  8. Now we check the frequency distribution of the education column in the dataset. Use the value_counts() function to implement this:

    df.education.value_counts()

    Have a look at the output for this in the following figure:

    Figure 1.66: Frequency distribution of the education column
    Figure 1.66: Frequency distribution of the education column
  9. In the preceding figure, we can see that the education column of the dataset has many categories. We need to reduce the categories for better modeling. To check the various categories in the education column, we use the unique() function. Type the following code to implement this:

    df.education.unique()

    The output is as follows:

    Figure 1.67: Various categories of the education column
    Figure 1.67: Various categories of the education column
  10. Now let's group the basic.4y, basic.9y, and basic.6y categories together and call them basic. To do this, we can use the replace function from pandas:

    df.education.replace({"basic.9y":"Basic","basic.6y":"Basic","basic.4y":"Basic"},inplace=True)

  11. To check the list of categories after grouping, add the following code:

    df.education.unique()

    Figure 1.68: Various categories of the education column
    Figure 1.68: Various categories of the education column

    In the preceding figure, you can see that basic.9y, basic.6y, and basic.4y are grouped together as Basic.

  12. Now we select and perform a suitable encoding method for the data. Add the following code to implement this:

    #Select all the non numeric data using select_dtypes function

    data_column_category = df.select_dtypes(exclude=[np.number]).columns

    The preceding code generates the following output:

    Figure 1.69: Various columns of the dataset
    Figure 1.69: Various columns of the dataset
  13. Now we define a list with all the names of the categorical features in the data. Also, we loop through every variable in the list, getting dummy variable encoded output. Add the following code to do this:

    cat_vars=data_column_category

    for var in cat_vars:

        cat_list='var'+'_'+var

        cat_list = pd.get_dummies(df[var], prefix=var)

        data1=df.join(cat_list)

        df=data1

     df.columns

    The preceding code generates the following output:

    Figure 1.70: List of categorical features in the data
    Figure 1.70: List of categorical features in the data
  14. Now we neglect the categorical column for which we have done encoding. We'll select only the numerical and encoded categorical columns. Add the code to do this:

    #Categorical features

    cat_vars=data_column_category

    #All features

    data_vars=df.columns.values.tolist()

    #neglecting the categorical column for which we have done encoding

    to_keep = []

    for i in data_vars:

        if i not in cat_vars:

            to_keep.append(i)

            

    #selecting only the numerical and encoded catergorical column

    data_final=df[to_keep]

    data_final.columns

    The preceding code generates the following output:

    Figure 1.71: List of numerical and encoded categorical columns
    Figure 1.71: List of numerical and encoded categorical columns
  15. Finally, we split the data into train and test sets. Add the following code to implement this:

    #Segregating Independent and Target variable

    X=data_final.drop(columns='y')

    y=data_final['y']

    from sklearn. model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    print("FULL Dateset X Shape: ", X.shape )

    print("Train Dateset X Shape: ", X_train.shape )

    print("Test Dateset X Shape: ", X_test.shape )

    The output is as follows:

Figure 1.72: Shape of the full, train, and test datasets
Figure 1.72: Shape of the full, train, and test datasets
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image