Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Exploratory Data Analysis with Python Cookbook
Exploratory Data Analysis with Python Cookbook

Exploratory Data Analysis with Python Cookbook: Over 50 recipes to analyze, visualize, and extract insights from structured and unstructured data

eBook
$9.99 $39.99
Paperback
$49.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Exploratory Data Analysis with Python Cookbook

Preparing Data for EDA

Before exploring and analyzing tabular data, we sometimes will be required to prepare the data for analysis. This preparation can come in the form of data transformation, aggregation, or cleanup. In Python, the pandas library helps us to achieve this through several modules. The preparation steps for tabular data are never a one-size-fits-all approach. They are typically determined by the structure of our data, that is, the rows, columns, data types, and data values.

In this chapter, we will focus on common data preparation techniques required to prepare our data for EDA:

  • Grouping data
  • Appending data
  • Concatenating data
  • Merging data
  • Sorting data
  • Categorizing data
  • Removing duplicate data
  • Dropping data rows and columns
  • Replacing data
  • Changing a data format
  • Dealing with missing values

Technical requirements

We will leverage the pandas library in Python for this chapter. The code and notebooks for this chapter are available on GitHub at https://github.com/PacktPublishing/Exploratory-Data-Analysis-with-Python-Cookbook.

Grouping data

When we group data, we are aggregating the data by category. This can be very useful especially when we need to get a high-level view of a detailed dataset. Typically, to group a dataset, we need to identify the column/category to group by, the column to aggregate by, and the specific aggregation to be done. The column/category to group by is usually a categorical column while the column to aggregate by is usually a numeric column. The aggregation to be done can be a count, sum, minimum, maximum, and so on. We can also perform aggregation such as count directly on the categorical column we group by

In pandas, the groupby method helps us group data.

Getting ready

We will work with one dataset in this chapter – the Marketing Campaign data from Kaggle.

Create a folder for this chapter and create a new Python script or Jupyter notebook file in that folder. Create a data subfolder and place the marketing_campaign.csv file in that subfolder. Alternatively, you can retrieve all the files from the GitHub repository.

Note

Kaggle provides the Marketing Campaign data for public use at https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis. In this chapter, we use both the full dataset and samples of the dataset for the different recipes. The data is also available in the repository. The data in Kaggle appears in a single-column format, but the data in the repository was transformed into a multiple-column format for easy usage in pandas.

How to do it…

We will learn how to group data using the pandas library:

  1. Import the pandas library:
    import pandas as pd
  2. Load the .csv file into a dataframe using read_csv. Then, subset the dataframe to include only relevant columns:
    marketing_data = pd.read_csv("data/marketing_campaign.csv")
    marketing_data = marketing_data[['ID','Year_Birth', 'Education','Marital_Status','Income','Kidhome', 'Teenhome', 'Dt_Customer',              'Recency','NumStorePurchases', 'NumWebVisitsMonth']]
  3. Inspect the data. Check the first few rows and use transpose (T) to show more information. Also, check the data types as well as the number of columns and rows:
    marketing_data.head(2).T
                0    1
    ID    5524    2174
    Year_Birth    1957    1954
    Education    Graduation    Graduation
    …        …        …
    NumWebVisitsMonth    7    5
    marketing_data.dtypes
    ID    int64
    Year_Birth    int64
    Education    object
    …           …
    NumWebVisitsMonth    int64
    marketing_data.shape
    (2240, 11)
  4. Use the groupby method in pandas to get the average number of store purchases of customers based on the number of kids at home:
    marketing_data.groupby('Kidhome')['NumStorePurchases'].mean()
    Kidhome
    0    7.2173240525908735
    1    3.863181312569522
    2    3.4375

That’s all. Now, we have grouped our dataset.

How it works...

All of the recipes in this chapter use the pandas library for data transformation and manipulation. We refer to pandas as pd in step 1. In step 2, we use read_csv to load the .csv file into a pandas dataframe and call it marketing_data. We also subset the dataframe to include only 11 relevant columns. In step 3, we inspect the dataset using the head method to see the first two rows in the dataset; we also use transform (T) along with head to transform the rows into columns, due to the size of the data (i.e., it has many columns). We use the dtypes attribute of the dataframe to show the data types of all columns. Numeric data has int and float data types while character data has the object data type. We inspect the number of rows and columns using shape, which returns a tuple that displays the number of rows as the first element and the number of columns as the second element.

In step 4, we apply the groupby method to get the average number of store purchases of customers based on the number of kids at home. Using the groupby method, we group by Kidhome, then we aggregate by NumStorePurchases, and finally, we use the mean method as the specific aggregation to be performed on NumStorePurchases.

There’s more...

Using the groupby method in pandas, we can group by multiple columns. Typically, these columns only need to be presented in a Python list to achieve this. Also, beyond the mean, several other aggregation methods can be applied, such as max, min, and median. In addition, the agg method can be used for aggregation; typically, we will need to provide specific numpy functions to be used. Custom functions for aggregation can be applied through the apply or transform method in pandas.

See also

Here is an insightful article by Dataquest on the groupby method in pandas: https://www.dataquest.io/blog/grouping-data-a-step-by-step-tutorial-to-groupby-in-pandas/.

Appending data

Sometimes, we may be analyzing multiple datasets that have a similar structure or samples of the same dataset. While analyzing our datasets, we may need to append them together into a new single dataset. When we append datasets, we stitch the datasets along the rows. For example, if we have 2 datasets containing 1,000 rows and 20 columns each, the appended data will contain 2,000 rows and 20 columns. The rows typically increase while the columns remain the same. The datasets are allowed to have a different number of rows but typically should have the same number of columns to avoid errors after appending.

In pandas, the concat method helps us append data.

Getting ready

We will continue working with the Marketing Campaign data from Kaggle. We will work with two samples of that dataset.

Place the marketing_campaign_append1.csv and marketing_campaign_append2.csv files in the data subfolder created in the first recipe. Alternatively, you could retrieve all the files from the GitHub repository.

How to do it…

We will explore how to append data using the pandas library:

  1. Import the pandas library:
    import pandas as pd
  2. Load the .csv files into a dataframe using read_csv. Then, subset the dataframes to include only relevant columns:
    marketing_sample1 = pd.read_csv("data/marketing_campaign_append1.csv")
    marketing_sample2 = pd.read_csv("data/marketing_campaign_append2.csv")
    marketing_sample1 = marketing_sample1[['ID', 'Year_Birth','Education','Marital_Status','Income', 'Kidhome','Teenhome','Dt_Customer', 'Recency','NumStorePurchases', 'NumWebVisitsMonth']]
    marketing_sample2 = marketing_sample2[['ID', 'Year_Birth','Education','Marital_Status','Income', 'Kidhome','Teenhome','Dt_Customer', 'Recency','NumStorePurchases', 'NumWebVisitsMonth']]
  3. Take a look at the two datasets. Check the first few rows and use transpose (T) to show more information:
    marketing_sample1.head(2).T
        0    1
    ID    5524    2174
    Year_Birth    1957    1954
    …        …        …
    NumWebVisitsMonth    7    5
    marketing_sample2.head(2).T
        0    1
    ID    9135    466
    Year_Birth    1950    1944
    …        …        …
    NumWebVisitsMonth    8    2
  4. Check the data types as well as the number of columns and rows:
    marketing_sample1.dtypes
    ID    int64
    Year_Birth    int64
    …          …
    NumWebVisitsMonth    int64
    marketing_sample2.dtypes
    ID    int64
    Year_Birth    int64
    …          …
    NumWebVisitsMonth    int64
    marketing_sample1.shape
    (500, 11)
    marketing_sample2.shape
    (500, 11)
  5. Append the datasets. Use the concat method from the pandas library to append the data:
    appended_data = pd.concat([marketing_sample1, marketing_sample2])
  6. Inspect the shape of the result and the first few rows:
    appended_data.head(2).T
        0    1
    ID    5524    2174
    Year_Birth    1957    1954
    Education    Graduation    Graduation
    Marital_Status    Single    Single
    Income    58138.0    46344.0
    Kidhome    0    1
    Teenhome    0    1
    Dt_Customer    04/09/2012    08/03/2014
    Recency    58    38
    NumStorePurchases    4    2
    NumWebVisitsMonth    7    5
    appended_data.shape
    (1000, 11)

Well done! We have appended our datasets.

How it works...

We import the pandas library and refer to it as pd in step 1. In step 2, we use read_csv to load the two .csv files to be appended into pandas dataframes. We call the dataframes marketing_sample1 and marketing_sample2 respectively. We also subset the dataframes to include only 11 relevant columns. In step 3, we inspect the dataset using the head method to see the first two rows in the dataset; we also use transform (T) along with head to transform the rows into columns due to the size of the data (i.e., it has many columns). In step 4, we use the dtypes attribute of the dataframe to show the data types of all columns. Numeric data has int and float data types while character data has the object data type. We inspect the number of rows and columns using shape, which returns a tuple that displays the number of rows and columns respectively.

In step 5, we apply the concat method to append the two datasets. The method takes in the list of dataframes as an argument. The list is the only argument required because the default setting of the concat method is to append data. In step 6, we inspect the first few rows of the output and its shape.

There’s more...

Using the concat method in pandas, we can append multiple datasets beyond just two. All that is required is to include these datasets in the list, and then they will be appended. It is important to note that the datasets must have the same columns.

Concatenating data

Sometimes, we may need to stitch multiple datasets or samples of the same dataset by columns and not rows. This is where we concatenate our data. While appending stitches rows of data together, concatenating stitches columns together to provide a single dataset. For example, if we have 2 datasets containing 1,000 rows and 20 columns each, the concatenated data will contain 1,000 rows and 40 columns. The columns typically increase while the rows remain the same. The datasets are allowed to have a different number of columns but typically should have the same number of rows to avoid errors after concatenating.

In pandas, the concat method helps us concatenate data.

Getting ready

We will continue working with the Marketing Campaign data from Kaggle. We will work with two samples of that dataset.

Place the marketing_campaign_concat1.csv and marketing_campaign_concat2.csv files in the data subfolder created in the first recipe. Alternatively, you can retrieve all the files from the GitHub repository.

How to do it…

We will explore how to concatenate data using the pandas library:

  1. Import the pandas library:
    import pandas as pd
  2. Load the .csv files into a dataframe using read_csv:
    marketing_sample1 = pd.read_csv("data/marketing_campaign_concat1.csv")
    marketing_sample2 = pd.read_csv("data/marketing_campaign_concat2.csv")
  3. Take a look at the two datasets. Check the first few rows and use transpose (T) to show more information:
    marketing_sample1.head(2).T
        0    1
    ID    5524    2174
    Year_Birth    1957    1954
    Education    Graduation    Graduation
    Marital_Status    Single    Single
    Income    58138.0    46344.0
    marketing_sample2.head(2).T
        0    1
    NumDealsPurchases    3    2
    NumWebPurchases    8    1
    NumCatalogPurchases    10    1
    NumStorePurchases    4    2
    NumWebVisitsMonth    7    5
  4. Check the data types as well as the number of columns and rows:
    marketing_sample1.dtypes
    ID    int64
    Year_Birth    int64
    Education    object
    Marital_Status    object
    Income    float64
    marketing_sample2.dtypes
    NumDealsPurchases    int64
    NumWebPurchases         int64
    NumCatalogPurchases    int64
    NumStorePurchases    int64
    NumWebVisitsMonth     int64
    marketing_sample1.shape
    (2240, 5)
    marketing_sample2.shape
    (2240, 5)
  5. Concatenate the datasets. Use the concat method from the pandas library to concatenate the data:
    concatenated_data = pd.concat([marketing_sample1, marketing_sample2], axis = 1)
  6. Inspect the shape of the result and the first few rows:
    concatenated_data.head(2).T
        0    1
    ID    5524    2174
    Year_Birth    1957    1954
    Education    Graduation    Graduation
    Marital_Status    Single    Single
    Income    58138.0    46344.0
    NumDealsPurchases    3    2
    NumWebPurchases    8    1
    NumCatalogPurchases    10    1
    NumStorePurchases    4    2
    NumWebVisitsMonth    7    5
    concatenated_data.shape
    (2240, 10)

Awesome! We have concatenated our datasets.

How it works...

We import the pandas library and refer to it as pd in step 1. In step 2, we use read_csv to load the two .csv files to be concatenated into pandas dataframes. We call the dataframes marketing_sample1 and marketing_sample2 respectively. In step 3, we inspect the dataset using head(2) to see the first two rows in the dataset; we also use transform (T) along with head to transform the rows into columns due to the size of the data (i.e., it has many columns). In step 4, we use the dtypes attribute of the dataframe to show the data types of all columns. Numeric data has int and float data types while character data has the object data type. We inspect the number of rows and columns using shape, which returns a tuple that displays the number of rows and columns respectively.

In step 5, we apply the concat method to concatenate the two datasets. Just like when appending, the method takes in the list of dataframes as an argument. However, it takes an additional argument for the axis parameter. The value 1 indicates that the axis refers to columns. The default value is typically 0, which refers to rows and is relevant for appending datasets. In step 6, we check the first few rows of the output as well as the shape.

There’s more...

Using the concat method in pandas, we can concatenate multiple datasets beyond just two. Just like appending, all that is required is to include these datasets in the list and the axis value, which is typically 1 for concatenation. It is important to note that the datasets must have the same number of rows.

See also

You can read this insightful article by Dataquest on concatenation: https://www.dataquest.io/blog/pandas-concatenation-tutorial/.

Merging data

Merging sounds a bit like concatenating our dataset; however, it is quite different. To merge datasets, we need to have a common field in both datasets on which we can perform a merge.

If you are familiar with the SQL or join commands, then you are probably familiar with merging data. Usually, data from relational databases will require merging operations. Relational databases typically contain tabular data and account for a significant proportion of data found in many organizations. Some key concepts to note when doing merge operations include the following:

  • Join key column: This refers to the common column within both datasets in which there are matching values. This is typically used to join the datasets. The columns do not need to have the same name; they only need to have matching values within the two datasets.
  • Type of join: There are different types of join operations that can be performed on datasets:
    • Left join: We retain all the rows in the left dataframe. Values in the right dataframe that do not match the values in the left dataframe are added as empty/Not a Number (NaN) values in the result. The matching is done based on the matching/join key column.
    • Right join: We retain all the rows in the right dataframe. Values in the left dataframe that do not match the values in the right dataframe are added as empty/NaN values in the result. The matching is done based on the matching/join key column.
    • Inner join: We retain only the common values in both the left and right dataframes in the result – that is, we do not return empty/NaN values.
    • Outer join/full outer join: We retain all the rows for the left and right dataframes. If the values do not match, NaN is added to the result.
Figure 2.1 – Venn diagrams illustrating different types of joins

Figure 2.1 – Venn diagrams illustrating different types of joins

In pandas, the merge method helps us to merge dataframes.

Getting ready

We will continue working with the Marketing Campaign data from Kaggle. We will work with two samples of that dataset.

Place the marketing_campaign_merge1.csv and marketing_campaign_merge2.csv files in the data subfolder created in the first recipe. Alternatively, you can retrieve all the files from the GitHub repository.

How to do it…

We will merge datasets using the pandas library:

  1. Import the pandas library:
    import pandas as pd
  2. Load the .csv files into a dataframe using read_csv:
    marketing_sample1 = pd.read_csv("data/marketing_campaign_merge1.csv")
    marketing_sample2 = pd.read_csv("data/marketing_campaign_merge2.csv")
  3. Take a look at the two datasets. Check the first few rows through the head method. Also, check the number of columns and rows:
    marketing_sample1.head()
          ID  Year_Birth  Education
    0    5524  1957     Graduation
    1    2174  1954     Graduation
    2    4141  1965     Graduation
    3    6182  1984     Graduation
    4    5324  1981      PhD
        ID    Marital_Status    Income
    0    5524    Single    58138.0
    1    2174    Single    46344.0
    2    4141    Together    71613.0
    3    6182    Together    26646.0
    4    5324    Married    58293.0
    marketing_sample1.shape
    (2240, 3)
    marketing_sample2.shape
    (2240, 3)
  4. Merge the datasets. Use the merge method from the pandas library to merge the datasets:
    merged_data = pd.merge(marketing_sample1,marketing_sample2,on = "ID")
  5. Inspect the shape of the result and the first few rows:
    merged_data.head()
        ID    Year_Birth    Education    Marital_Status    Income
    0    5524    1957    Graduation    Single    58138.0
    1    2174    1954    Graduation    Single    46344.0
    2    4141    1965    Graduation    Together    71613.0
    3    6182    1984    Graduation    Together    26646.0
    4    5324    1981    PhD    Married    58293.0
    merged_data.shape
    (2240, 5)

Great! We have merged our dataset.

How it works...

We import the pandas library and refer to it as pd in step 1. In step 2, we use read_csv to load the two .csv files to be merged into pandas dataframes. We call the dataframes marketing_sample1 and marketing_sample2 respectively. In step 3, we inspect the dataset using head() to see the first five rows in the dataset. We inspect the number of rows and columns using shape, which returns a tuple that displays the number of rows and columns respectively.

In step 4, we apply the merge method to merge the two datasets. We provide four arguments for the merge method. The first two arguments are the dataframes we want to merge, the third specifies the key or common column upon which a merge can be achieved. The merge method also has a how parameter. This parameter specifies the type of join to be used. The default parameter of this argument is an inner join.

There’s more...

Sometimes, the common field in two datasets may have a different name. The merge method allows us to address this through two arguments, left_on and right_on. left_on specifies the key on the left dataframe, while right_on is the same thing on the right dataframe.

See also

You can check out this useful resource by Real Python on merging data in pandas: https://realpython.com/pandas-merge-join-and-concat/.

Sorting data

When we sort data, we arrange it in a specific sequence. This specific sequence typically helps us to spot patterns very quickly. To sort a dataset, we usually must specify one or more columns to sort by and specify the order to sort by (ascending or descending order).

In pandas, the sort_values method can be used to sort a dataset.

Getting ready

We will work with the Marketing Campaign data (https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis) for this recipe. Alternatively, you can retrieve this from the GitHub repository.

How to do it…

We will sort data using the pandas library:

  1. Import the pandas library:
    import pandas as pd
  2. Load the .csv file into a dataframe using read_csv. Then, subset the dataframe to include only relevant columns:
    marketing_data = pd.read_csv("data/marketing_campaign.csv")
    marketing_data = marketing_data[['ID','Year_Birth', 'Education','Marital_Status','Income','Kidhome', 'Teenhome', 'Dt_Customer',              'Recency','NumStorePurchases', 'NumWebVisitsMonth']]
  3. Inspect the data. Check the first few rows and use transpose (T) to show more information. Also, check the data types as well as the number of columns and rows:
    marketing_data.head(2).T
                0    1
    ID    5524    2174
    Year_Birth    1957    1954
    Education    Graduation    Graduation
    …        …        …
    NumWebVisitsMonth    7    5
    marketing_data.dtypes
    ID    int64
    Year_Birth    int64
    Education    object
    …          …
    NumWebVisitsMonth    int64
    marketing_data.shape
    (2240, 11)
  4. Sort customers based on the number of store purchases in descending order:
    sorted_data = marketing_data.sort_values('NumStorePurchases', ascending=False)
  5. Inspect the result. Subset for relevant columns:
    sorted_data[['ID','NumStorePurchases']]
        ID    NumStorePurchases
    1187    9855    13
    803    9930    13
    1144    819    13
    286    10983    13
    1150    1453    13
     ...     ...    ...
    164    8475    0
    2214    9303    0
    27    5255    0
    1042    10749    0
    2132    11181    0

Great! We have sorted our dataset.

How it works...

We refer to pandas as pd in step 1. In step 2, we use read_csv to load the .csv file into a pandas dataframe and call it marketing_data. We also subset the dataframe to include only 11 relevant columns. In step 3, we inspect the dataset using head(2) to see the first two rows in the dataset; we also use transpose (T) along with head to transform the rows into columns due to the size of the data (i.e., it has many columns). We use the dtypes attribute of the dataframe to show the data types of all columns. Numeric data has int and float data types while character data has the object data type. We inspect the number of rows and columns using shape, which returns a tuple that displays the number of rows as the first element and the number of columns as the second element.

In step 4, we apply the sort_values method to sort the NumStorePurchases column. Using the sort values method, we sort NumStorePurchases in descending order. The method takes two arguments, the dataframe column to be sorted and the sorting order. false indicates a sort in descending order while true indicates a sort in ascending order.

There’s more...

Sorting can be done across multiple columns in pandas. We can sort based on multiple columns by supplying columns as a list in the sort_values method. The sort will be performed in the order in which the columns are supplied – that is, column 1 first, then column 2 next, and subsequent columns. Also, a sort isn’t limited to numerical columns alone; it can be used for columns containing characters.

Categorizing data

When we refer to categorizing data, we are specifically referring to binning, bucketing, or cutting a dataset. Binning involves grouping the numeric values in a dataset into smaller intervals called bins or buckets. When we bin numerical values, each bin becomes a categorical value. Bins are very useful because they can provide us with insights that may have been difficult to spot if we had worked directly with individual numerical values. Bins don’t always have equal intervals; the creation of bins is dependent on our understanding of a dataset.

Binning can also be used to address outliers or reduce the effect of observation errors. Outliers are unusually high or unusually low data points that are far from other data points in our dataset. They typically lead to anomalies in the output of our analysis. Binning can reduce this effect by placing the range of numerical values including the outliers into specific buckets, thereby making the values categorical. A common example of this is when we convert age values into age groups. Outlier ages such as 0 or 150 can fall into a less than age 18 bin and greater than age 80 bin respectively.

In pandas, the cut method can be used to bin a dataset.

Getting ready

We will work with the full Marketing Campaign data for this recipe.

How to do it…

We will categorize data using the pandas library:

  1. Import the pandas library:
    import pandas as pd
  2. Load the .csv file into a dataframe using read_csv. Then, subset the dataframe to include only relevant columns:
    marketing_data = pd.read_csv("data/marketing_campaign.csv")
    marketing_data = marketing_data[['ID','Year_Birth', 'Education','Marital_Status','Income','Kidhome', 'Teenhome', 'Dt_Customer',              'Recency','NumStorePurchases', 'NumWebVisitsMonth']]
  3. Inspect the data. Check the first few rows and use transpose (T) to show more information. Also, check the data types as well as the number of columns and rows:
    marketing_data.head(2).T
                0    1
    ID    5524    2174
    Year_Birth    1957    1954
    Education    Graduation    Graduation
    …        …        …
    NumWebVisitsMonth    7    5
    marketing_data.dtypes
    ID    int64
    Year_Birth    int64
    Education    object
    …          …
    NumWebVisitsMonth    int64
    marketing_data.shape
    (2240, 11)
  4. Categorize the number of store purchases into high, moderate, and low categories:
    marketing_data['bins'] = pd.cut(x=marketing_data['NumStorePurchases'], bins=[0,4,8,13],labels = ['Low', 'Moderate', 'High'])
  5. Inspect the result. Subset for relevant columns:
    marketing_data[['NumStorePurchases','bins']].head()
        NumStorePurchases    bins
    0    4    Low
    1    2    Low
    2    10    High
    3    4    Low
    4    6    Moderate

We have now categorized our dataset into bins.

How it works...

We refer to pandas as pd in step 1. In step 2, we use read_csv to load the .csv file into a pandas dataframe and call it marketing_data. We also subset the dataframe to include only 11 relevant columns. In step 3, we inspect the dataset using head(2) to see the first two rows in the dataset; we also use transpose (T) along with head to transform the rows into columns due to the size of the data (i.e., it has many columns). We use the dtypes attribute of the dataframe to show the data types of all columns. Numeric data has int and float data types while character data has the object data type. We inspect the number of rows and columns using shape, which returns a tuple that displays the number of rows and columns respectively.

In step 4, we categorize the number of store purchases into three categories, namely High, Moderate, and Low. Using the cut method, we cut NumStorePurchases into these three bins and supply the logic for binning within the bin parameter, which is the second parameter. The third parameter is the label parameter. Whenever we manually supply the bin edges in a list as done previously, the bins are typically the number of label categories + 1.

Our bins can be interpreted as 0–4 (low), 5–8 (moderate), and 9–13 (high). In step 5, we subset for relevant columns and inspect the result of our binning.

There’s more...

For the bin argument, we can also supply the number of bins we require instead of supplying the bin edges manually. This means in the previous steps, we could have supplied the value 3 to the bin parameter and the cut method would have categorized our data into three equally spaced bins. When the value 3 is supplied, the cut method focuses on the equal spacing of the bins, even though the number of records in the bins may be different.

If we are also interested in the distribution of our bins and not just equally spaced bins or user-defined bins, the qcut method in pandas can be used. The qcut method ensures the distribution of data in the bins is equal. It ensures all bins have (roughly) the same number of observations, even though the bin range may vary.

Removing duplicate data

Duplicate data can be very misleading and can lead us to wrong conclusions about patterns and the distribution of our data. Therefore, it is very important to address duplicate data within our dataset before embarking on any analysis. Performing a quick duplicate check is good practice in EDA. When working with tabular datasets, we can identify duplicate values in specific columns or duplicate records (across multiple columns). A good understanding of our dataset and the domain will give us insight into what should be considered a duplicate. In pandas, the drop_duplicates method can help us with handling duplicate values or records within our dataset.

Getting ready

We will work with the full Marketing Campaign data for this recipe.

How to do it…

We will remove duplicate data using the pandas library:

  1. Import the pandas library:
    import pandas as pd
  2. Load the .csv file into a dataframe using read_csv. Then, subset the dataframe to include only relevant columns:
    marketing_data = pd.read_csv("data/marketing_campaign.csv")
    marketing_data = marketing_data[['Education','Marital_Status','Kidhome', 'Teenhome']]
  3. Inspect the data. Check the first few rows. Also, check the number of columns and rows:
    marketing_data.head()
            Education    Marital_Status    Kidhome    Teenhome
    0    Graduation    Single    0    0
    1    Graduation    Single    1    1
    2    Graduation    Together    0    0
    3    Graduation    Together    1    0
    4    PhD    Married    1    0
    marketing_data.shape
    (2240, 4)
  4. Remove duplicates across the four columns in our dataset:
    marketing_data_duplicate = marketing_data.drop_duplicates()
  5. Inspect the result:
    marketing_data_duplicate.head()
        Education    Marital_Status    Kidhome    Teenhome
    0    Graduation    Single    0    0
    1    Graduation    Single    1    1
    2    Graduation    Together    0    0
    3    Graduation    Together    1    0
    4    PhD    Married    1    0
    marketing_data_duplicate.shape
    (135,4)

We have now removed duplicates from our dataset.

How it works...

We refer to pandas as pd in step 1. In step 2, we use read_csv to load the .csv file into a pandas dataframe and call it marketing_data. We also subset the dataframe to include only four relevant columns. In step 3, we inspect the dataset using head() to see the first five rows in the dataset. Using the shape method, we get a sense of the number of rows and columns from the tuple respectively.

In step 4, we use the drop_duplicates method to remove duplicate rows that appear in the four columns of our dataset. We save the result in the marketing_data_duplicate variable. In step 5, we inspect the result using the head method to see the first five rows. We also leverage the shape method to inspect the number of rows and columns. We can see that the rows have decreased significantly from our original shape.

There’s more...

The drop_duplicates method gives some flexibility around dropping duplicates based on a subset of columns. By supplying the list of the subset columns as the first argument, we can drop all rows that contain duplicates based on those subset columns. This is useful when we have several columns and only a few key columns contain duplicate information. Also, it allows us to keep instances of duplicates, using the keep parameter. With the keep parameter, we can specify whether we want to keep the “first” or “last” instance or drop all instances of the duplicate information. By default, the method keeps the first instance.

Dropping data rows and columns

When working with tabular data, we may have reason to drop some rows or columns within our dataset. Sometimes, we may need to drop columns or rows either because they are erroneous or irrelevant. In pandas, we have the flexibility to drop a single row/column or multiple rows/columns. We can use the drop method to achieve this.

Getting ready

We will work with the full Marketing Campaign data for this recipe.

How to do it…

We will drop rows and columns using the pandas library:

  1. Import the pandas library:
    import pandas as pd
  2. Load the .csv file into a dataframe using read_csv. Then, subset the dataframe to include only relevant columns:
    marketing_data = pd.read_csv("data/marketing_campaign.csv")
    marketing_data = marketing_data[['ID', 'Year_Birth', 'Kidhome', 'Teenhome']]
  3. Inspect the data. Check the first few rows. Check the number of columns and rows:
    marketing_data.head()
        ID    Year_Birth    Education    Marital_Status
    0    5524    1957    Graduation    Single
    1    2174    1954    Graduation    Single
    2    4141    1965    Graduation    Together
    3    6182    1984    Graduation    Together
    4    5324    1981    PhD    Married
    marketing_data.shape
    (5, 4)
  4. Delete a specified row at index value 1:
    marketing_data.drop(labels=[1], axis=0)
        ID    Year_Birth    Education    Marital_Status
    0    5524    1957    Graduation    Single
    2    4141    1965    Graduation    Together
    3    6182    1984    Graduation    Together
    4    5324    1981    PhD    Married
  5. Delete a single column:
    marketing_data.drop(labels=['Year_Birth'], axis=1)
        ID    Education    Marital_Status
    0    5524    Graduation    Single
    1    2174    Graduation    Single
    2    4141    Graduation    Together
    3    6182    Graduation    Together
    4    5324    PhD    Married

Good job! We have dropped rows and columns from our dataset.

How it works...

We refer to pandas as pd in step 1. In step 2, we use read_csv to load the .csv file into a pandas dataframe and call it marketing_data. We also subset the dataframe to include only four relevant columns. In step 3, we inspect the dataset using head() to see the first five rows in the dataset. Using the shape method, we get a sense of the number of rows and columns from the tuple respectively.

In step 4, we use the drop method to delete a specified row at index value 1 and view the result, which shows the row at index 1 has been removed. The drop method takes a list of indices as the first argument and an axis value as the second. The axis value determines whether the drop operation will be performed on a row or column. A value of 0 is used for rows while 1 is used for columns.

In step 5, we use the drop method to delete a specified column and view the result, which shows the specific column has been removed. To drop columns, we need to specify the name of the column and provide the axis value of 1.

There’s more...

We can drop multiple rows or columns using the drop method. To achieve this, we need to specify all the row indices or column names in a list and provide the respective axis value of 0 or 1 for rows and columns respectively.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Gain practical experience in conducting EDA on a single variable of interest in Python
  • Learn the different techniques for analyzing and exploring tabular, time series, and textual data in Python
  • Get well versed in data visualization using leading Python libraries like Matplotlib and seaborn

Description

In today's data-centric world, the ability to extract meaningful insights from vast amounts of data has become a valuable skill across industries. Exploratory Data Analysis (EDA) lies at the heart of this process, enabling us to comprehend, visualize, and derive valuable insights from various forms of data. This book is a comprehensive guide to Exploratory Data Analysis using the Python programming language. It provides practical steps needed to effectively explore, analyze, and visualize structured and unstructured data. It offers hands-on guidance and code for concepts such as generating summary statistics, analyzing single and multiple variables, visualizing data, analyzing text data, handling outliers, handling missing values and automating the EDA process. It is suited for data scientists, data analysts, researchers or curious learners looking to gain essential knowledge and practical steps for analyzing vast amounts of data to uncover insights. Python is an open-source general purpose programming language which is used widely for data science and data analysis given its simplicity and versatility. It offers several libraries which can be used to clean, analyze, and visualize data. In this book, we will explore popular Python libraries such as Pandas, Matplotlib, and Seaborn and provide workable code for analyzing data in Python using these libraries. By the end of this book, you will have gained comprehensive knowledge about EDA and mastered the powerful set of EDA techniques and tools required for analyzing both structured and unstructured data to derive valuable insights.

Who is this book for?

Whether you are a data analyst, data scientist, researcher or a curious learner looking to analyze structured and unstructured data, this book will appeal to you. It aims to empower you with essential knowledge and practical skills for analyzing and visualizing data to uncover insights. It covers several EDA concepts and provides hands-on instructions on how these can be applied using various Python libraries. Familiarity with basic statistical concepts and foundational knowledge of python programming will help you understand the content better and maximize your learning experience.

What you will learn

  • Perform EDA with leading python data visualization libraries
  • Execute univariate, bivariate and multivariate analysis on tabular data
  • Uncover patterns and relationships within time series data
  • Identify hidden patterns within textual data
  • Learn different techniques to prepare data for analysis
  • Overcome challenge of outliers and missing values during data analysis
  • Leverage automated EDA for fast and efficient analysis

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jun 30, 2023
Length: 382 pages
Edition : 1st
Language : English
ISBN-13 : 9781803246130
Category :
Languages :
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Jun 30, 2023
Length: 382 pages
Edition : 1st
Language : English
ISBN-13 : 9781803246130
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 149.97
Building Statistical Models in Python
$49.99
Machine Learning Engineering  with Python
$49.99
Exploratory Data Analysis with Python Cookbook
$49.99
Total $ 149.97 Stars icon
Banner background image

Table of Contents

12 Chapters
Chapter 1: Generating Summary Statistics Chevron down icon Chevron up icon
Chapter 2: Preparing Data for EDA Chevron down icon Chevron up icon
Chapter 3: Visualizing Data in Python Chevron down icon Chevron up icon
Chapter 4: Performing Univariate Analysis in Python Chevron down icon Chevron up icon
Chapter 5: Performing Bivariate Analysis in Python Chevron down icon Chevron up icon
Chapter 6: Performing Multivariate Analysis in Python Chevron down icon Chevron up icon
Chapter 7: Analyzing Time Series Data in Python Chevron down icon Chevron up icon
Chapter 8: Analysing Text Data in Python Chevron down icon Chevron up icon
Chapter 9: Dealing with Outliers and Missing Values Chevron down icon Chevron up icon
Chapter 10: Performing Automated Exploratory Data Analysis in Python Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.8
(5 Ratings)
5 star 80%
4 star 20%
3 star 0%
2 star 0%
1 star 0%
Ram Seshadri Sep 05, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
EDA is a very difficult step in machine learning process. Many newbies find this the most challenging part of their ML journey. This book makes that somewhat easy by providing step by step instructions on how to perform various steps in EDA by taking different kinds of data and performing EDA on them. Here are the highlights that I found useful:1. Performing univariate analysis is the first step when performing EDA. This book provides 6 charts to use to analyze data in this section.2. In bivariate and multivariate analysis there are over 15 methods discussed.3. Next there are sections on analyzing time series data and how to analyze Text variables for NLP use cases. These are not usually discussed in many EDA books.4. Finally the book handles missing values and outliers along with an overview of auto EDA tools.In summary I found the book comprehensive and a quick way to improve your EDA skills. There are over 50 recipes discussed and I think you will find many of them useful. All in all I highly recommend this book.
Amazon Verified review Amazon
Om S Aug 11, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Navigating through the pages of the "Exploratory Data Analysis with Python Cookbook" feels like following a guiding beacon into the realm of unraveling hidden insights within data. This resourceful book introduces Python's capabilities in single-variable EDA, providing techniques to analyze tabular, time series, and textual data with tools like Matplotlib and Seaborn. From crafting visual narratives to adeptly handling outliers and missing values, it serves as a comprehensive companion for data analysts, scientists, and curious learners alike. Its pragmatic approach makes it an invaluable asset for those new to the field as well as seasoned practitioners.
Amazon Verified review Amazon
Dr. Chu Meh Chu Oct 13, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I have been looking for an introductory book on exploratory data analysis for a while. This book, more than answered and satisfied my curiosity. It is a truly a cookbook and will give the reader clear solutions to their data driven problems. I found the organization of the book very easy to follow which correspondingly boosted my confidence in my ability to work on EDA (Exploratory data analysis). I highly recommend this book for its level of detail, readability, and understanding.
Amazon Verified review Amazon
Mojeed Abisiga Jul 20, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book excels in every category, ranking in the top 2 to 3% of all the data analytics books I've ever read or come across. And yes my favorite part of the book was where he used pyLDAvis module to plot topics and top words of a reviews data, it reminded me of an interesting project I worked on that I used those same set of visualizations.
Amazon Verified review Amazon
Taylor B. Oct 30, 2023
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
I’m starting to learn python and have taken a couple machine learning classes. I bought this book to brush up on my knowledge a bit more. It is easy to follow, and you can easily recreate a lot of the recipes in this book. The lessons build on one another, so that you learn how to do combinations of things. I wish there were some more advanced examples or assignments. Like, this is how to do joining. Now here’s another more extreme version of it, or something like that. I feel that the single examples don’t provide a strong knowledge of the concepts. But if you are just learning these for the first time, it’s a great book.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.