Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Exploratory Data Analysis with Python Cookbook

You're reading from   Exploratory Data Analysis with Python Cookbook Over 50 recipes to analyze, visualize, and extract insights from structured and unstructured data

Arrow left icon
Product type Paperback
Published in Jun 2023
Publisher Packt
ISBN-13 9781803231105
Length 382 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Ayodele Oluleye Ayodele Oluleye
Author Profile Icon Ayodele Oluleye
Ayodele Oluleye
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Chapter 1: Generating Summary Statistics 2. Chapter 2: Preparing Data for EDA FREE CHAPTER 3. Chapter 3: Visualizing Data in Python 4. Chapter 4: Performing Univariate Analysis in Python 5. Chapter 5: Performing Bivariate Analysis in Python 6. Chapter 6: Performing Multivariate Analysis in Python 7. Chapter 7: Analyzing Time Series Data in Python 8. Chapter 8: Analysing Text Data in Python 9. Chapter 9: Dealing with Outliers and Missing Values 10. Chapter 10: Performing Automated Exploratory Data Analysis in Python 11. Index 12. Other Books You May Enjoy

Dealing with missing values

Dealing with missing values is a common problem we will typically face when analyzing data. A missing value is a value within a field or variable that is not present, even though it is expected to be. There are several reasons why this could have happened, but a common reason is that the data value wasn’t provided at the point of data collection. As we explore and analyze data, missing values can easily lead to inaccurate or biased conclusions; therefore, they need to be taken care of. Missing values are typically represented by blank spaces, but in pandas, they are represented by NaN.

Several techniques can be used to deal with missing values. In this recipe, we will focus on dropping missing values using the dropna method in pandas.

Getting ready

We will work with the full Marketing Campaign data in this recipe.

How to do it…

We will drop rows and columns using the pandas library:

  1. Import the pandas library:
    import pandas as pd
  2. Load the .csv into a dataframe using read_csv. Then, subset the dataframe to include only relevant columns:
    marketing_data = pd.read_csv("data/marketing_campaign.csv")
    marketing_data = marketing_data[['ID', 'Year_Birth', 'Education','Income']]
  3. Inspect the data. Check the first few rows, and check the number of columns and rows:
    marketing_data.head()
        ID    Year_Birth        Education    Income
    0    5524    1957            Graduation    58138.0
    1    2174    1954            Graduation    46344.0
    2    4141    1965            Graduation    71613.0
    3    6182    1984            Graduation    26646.0
    4    5324    1981            PhD         58293.0
    marketing_data.shape
    (2240, 4)
  4. Check for missing values using the isnull and sum methods:
    marketing_data.isnull().sum()
    ID             0
    Year_Birth     0
    Education      0
    Income        24
  5. Drop missing values using the dropna method:
    marketing_data_withoutna = marketing_data.dropna(how = 'any')
    marketing_data_withoutna.shape
    (2216, 4)

Good job! We have dropped missing values from our dataset.

How it works...

In step 1, we import pandas and refer to it as pd. In step 2, we use read_csv to load the .csv file into a pandas dataframe and call it marketing_data. We also subset the dataframe to include only four relevant columns. In step 3, we inspect the dataset using the head and shape methods.

In step 4, we use the isnull and sum methods to check for missing values. These methods give us the number of rows with missing values within each column in our dataset. Columns with zero have no rows with missing values.

In step 5, we use the dropna method to drop missing values. For the how parameter, we supply 'any' as the value to indicate that we want to drop rows that have missing values in any of the columns. An alternative value to use is 'all', which ensures all the columns of a row have missing values before dropping the row. We then check the number of rows and columns using the shape method. We can see that the final dataset has 24 fewer rows.

There’s more...

As highlighted previously, there are several reasons why a value may be missing in our dataset. Understanding the reason can point us to optimal solutions to resolve this problem. Missing values shouldn’t be addressed with a one-size-fits-all approach. Chapter 9 dives into the details of how to optimally deal with missing values and outliers by providing several techniques.

See also

You can check out a detailed approach to dealing with missing values and outliers in Chapter 9, Dealing with Outliers and Missing Values.

You have been reading a chapter from
Exploratory Data Analysis with Python Cookbook
Published in: Jun 2023
Publisher: Packt
ISBN-13: 9781803231105
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime