You're reading from Exploratory Data Analysis with Python Cookbook Over 50 recipes to analyze, visualize, and extract insights from structured and unstructured data

Product type Paperback

Published in Jun 2023

Publisher Packt

ISBN-13 9781803231105

Length 382 pages

Edition 1st Edition

Languages

Python

Tools

Matplotlib

Concepts

Data Analysis

Author (1):

Ayodele Oluleye

View More author details

Table of Contents (13) Chapters

Preface

1. Chapter 1: Generating Summary Statistics

2. Chapter 2: Preparing Data for EDA FREE CHAPTER

3. Chapter 3: Visualizing Data in Python

4. Chapter 4: Performing Univariate Analysis in Python

5. Chapter 5: Performing Bivariate Analysis in Python

6. Chapter 6: Performing Multivariate Analysis in Python

7. Chapter 7: Analyzing Time Series Data in Python

8. Chapter 8: Analysing Text Data in Python

9. Chapter 9: Dealing with Outliers and Missing Values

10. Chapter 10: Performing Automated Exploratory Data Analysis in Python

11. Index

Why subscribe?

12. Other Books You May Enjoy

Dealing with missing values

Dealing with missing values is a common problem we will typically face when analyzing data. A missing value is a value within a field or variable that is not present, even though it is expected to be. There are several reasons why this could have happened, but a common reason is that the data value wasn’t provided at the point of data collection. As we explore and analyze data, missing values can easily lead to inaccurate or biased conclusions; therefore, they need to be taken care of. Missing values are typically represented by blank spaces, but in pandas, they are represented by NaN.

Several techniques can be used to deal with missing values. In this recipe, we will focus on dropping missing values using the dropna method in pandas.

Getting ready

We will work with the full Marketing Campaign data in this recipe.

How to do it…

We will drop rows and columns using the pandas library:

Import the pandas library:
```
import pandas as pd
```

Load the .csv into a dataframe using read_csv. Then, subset the dataframe to include only relevant columns:

marketing_data = pd.read_csv("data/marketing_campaign.csv")

marketing_data = marketing_data[['ID', 'Year_Birth', 'Education','Income']]

Inspect the data. Check the first few rows, and check the number of columns and rows:

marketing_data.head()

    ID    Year_Birth        Education    Income

0    5524    1957            Graduation    58138.0

1    2174    1954            Graduation    46344.0

2    4141    1965            Graduation    71613.0

3    6182    1984            Graduation    26646.0

4    5324    1981            PhD         58293.0

marketing_data.shape

(2240, 4)

Check for missing values using the isnull and sum methods:

marketing_data.isnull().sum()

ID             0

Year_Birth     0

Education      0

Income        24

Drop missing values using the dropna method:

marketing_data_withoutna = marketing_data.dropna(how = 'any')

marketing_data_withoutna.shape

(2216, 4)

Good job! We have dropped missing values from our dataset.

How it works...

In step 1, we import pandas and refer to it as pd. In step 2, we use read_csv to load the .csv file into a pandas dataframe and call it marketing_data. We also subset the dataframe to include only four relevant columns. In step 3, we inspect the dataset using the head and shape methods.

In step 4, we use the isnull and sum methods to check for missing values. These methods give us the number of rows with missing values within each column in our dataset. Columns with zero have no rows with missing values.

In step 5, we use the dropna method to drop missing values. For the how parameter, we supply 'any' as the value to indicate that we want to drop rows that have missing values in any of the columns. An alternative value to use is 'all', which ensures all the columns of a row have missing values before dropping the row. We then check the number of rows and columns using the shape method. We can see that the final dataset has 24 fewer rows.

There’s more...

As highlighted previously, there are several reasons why a value may be missing in our dataset. Understanding the reason can point us to optimal solutions to resolve this problem. Missing values shouldn’t be addressed with a one-size-fits-all approach. Chapter 9 dives into the details of how to optimally deal with missing values and outliers by providing several techniques.