Dealing with missing values
Dealing with missing values is a common problem we will typically face when analyzing data. A missing value is a value within a field or variable that is not present, even though it is expected to be. There are several reasons why this could have happened, but a common reason is that the data value wasn’t provided at the point of data collection. As we explore and analyze data, missing values can easily lead to inaccurate or biased conclusions; therefore, they need to be taken care of. Missing values are typically represented by blank spaces, but in pandas
, they are represented by NaN.
Several techniques can be used to deal with missing values. In this recipe, we will focus on dropping missing values using the dropna
method in pandas
.
Getting ready
We will work with the full Marketing Campaign data in this recipe.
How to do it…
We will drop rows and columns using the pandas
library:
- Import the
pandas
library:import pandas as pd
- Load the
.csv
into a dataframe usingread_csv
. Then, subset the dataframe to include only relevant columns:marketing_data = pd.read_csv("data/marketing_campaign.csv")
marketing_data = marketing_data[['ID', 'Year_Birth', 'Education','Income']]
- Inspect the data. Check the first few rows, and check the number of columns and rows:
marketing_data.head()
ID Year_Birth Education Income
0 5524 1957 Graduation 58138.0
1 2174 1954 Graduation 46344.0
2 4141 1965 Graduation 71613.0
3 6182 1984 Graduation 26646.0
4 5324 1981 PhD 58293.0
marketing_data.shape
(2240, 4)
- Check for missing values using the
isnull
andsum
methods:marketing_data.isnull().sum()
ID 0
Year_Birth 0
Education 0
Income 24
- Drop missing values using the
dropna
method:marketing_data_withoutna = marketing_data.dropna(how = 'any')
marketing_data_withoutna.shape
(2216, 4)
Good job! We have dropped missing values from our dataset.
How it works...
In step 1, we import pandas
and refer to it as pd
. In step 2, we use read_csv
to load the .csv
file into a pandas
dataframe and call it marketing_data
. We also subset the dataframe to include only four relevant columns. In step 3, we inspect the dataset using the head
and shape
methods.
In step 4, we use the isnull
and sum
methods to check for missing values. These methods give us the number of rows with missing values within each column in our dataset. Columns with zero have no rows with missing values.
In step 5, we use the dropna
method to drop missing values. For the how
parameter, we supply 'any'
as the value to indicate that we want to drop rows that have missing values in any of the columns. An alternative value to use is 'all'
, which ensures all the columns of a row have missing values before dropping the row. We then check the number of rows and columns using the shape
method. We can see that the final dataset has 24 fewer rows.
There’s more...
As highlighted previously, there are several reasons why a value may be missing in our dataset. Understanding the reason can point us to optimal solutions to resolve this problem. Missing values shouldn’t be addressed with a one-size-fits-all approach. Chapter 9 dives into the details of how to optimally deal with missing values and outliers by providing several techniques.
See also
You can check out a detailed approach to dealing with missing values and outliers in Chapter 9, Dealing with Outliers and Missing Values.