You're reading from Python Data Cleaning Cookbook Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI

Product type Paperback

Published in May 2024

Publisher Packt

ISBN-13 9781803239873

Length 486 pages

Edition 2nd Edition

Languages

Python

Tools

Matplotlib

Concepts

Data Analysis

Author (1):

Michael Walker

View More author details

Table of Contents (14) Chapters

Preface

1. Anticipating Data Cleaning Issues When Importing Tabular Data with pandas FREE CHAPTER

2. Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data

3. Taking the Measure of Your Data

4. Identifying Outliers in Subsets of Data

5. Using Visualizations for the Identification of Unexpected Values

6. Cleaning and Exploring Data with Series Operations

7. Identifying and Fixing Missing Values

8. Encoding, Transforming, and Scaling Features

9. Fixing Messy Data When Aggregating

10. Addressing Data Issues When Combining DataFrames

11. Tidying and Reshaping Data

12. Automate Data Cleaning with User-Defined Functions, Classes, and Pipelines

13. Index

Importing CSV files

The read_csv method of the pandas library can be used to read a file with comma separated values (CSV) and load it into memory as a pandas DataFrame. In this recipe, we import a CSV file and address some common issues: creating column names that make sense to us, parsing dates, and dropping rows with critical missing data.

Raw data is often stored as CSV files. These files have a carriage return at the end of each line of data to demarcate a row, and a comma between each data value to delineate columns. Something other than a comma can be used as the delimiter, such as a tab. Quotation marks may be placed around values, which can be helpful when the delimiter occurs naturally within certain values, which sometimes happens with commas.

All data in a CSV file are characters, regardless of the logical data type. This is why it is easy to view a CSV file, presuming it is not too large, in a text editor. The pandas read_csv method will make an educated guess about the data type of each column, but you will need to help it along to ensure that these guesses are on the mark.

Getting ready

Create a folder for this chapter, and then create a new Python script or Jupyter Notebook file in that folder. Create a data subfolder, and then place the landtempssample.csv file in that subfolder. Alternatively, you could retrieve all of the files from the GitHub repository, including the data files. Here is a screenshot of the beginning of the CSV file:

Figure 1.1: Land Temperatures Data

Data note

This dataset, taken from the Global Historical Climatology Network integrated database, is made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-monthly. I used the data from version 4. The data in this recipe uses a 100,000-row sample of the full dataset, which is also available in the repository.

How to do it…

We will import a CSV file into pandas, taking advantage of some very useful read_csv options:

Import the pandas library, and set up the environment to make viewing the output easier:

import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.width', 85)
pd.set_option('display.max_columns', 8)

Read the data file, set new names for the headings, and parse the date column.

Pass an argument of 1 to the skiprows parameter to skip the first row, pass a list of columns to parse_dates to create a pandas datetime column from those columns, and set low_memory to False. This will cause pandas to load all of the data into memory at once, rather than in chunks. We do this so that pandas can identify the data type of each column automatically. In the There’s more… section, we see how to set the data type for each column manually:

landtemps = pd.read_csv('data/landtempssample.csv',
...     names=['stationid','year','month','avgtemp','latitude',
...       'longitude','elevation','station','countryid','country'],
...     skiprows=1,
...     parse_dates=[['month','year']],
...     low_memory=False)
type(landtemps)

<class 'pandas.core.frame.DataFrame'>

Note

We have to use skiprows because we are passing a list of column names to read_csv. If we use the column names in the CSV file, we do not need to specify values for either names or skiprows.

Get a quick glimpse of the data.

View the first few rows. Show the data type for all columns, as well as the number of rows and columns:

landtemps.head(7)

  month_year    stationid  ...  countryid              country
0 2000-04-01  USS0010K01S  ...         US        United States
1 1940-05-01  CI000085406  ...         CI                Chile
2 2013-12-01  USC00036376  ...         US        United States
3 1963-02-01  ASN00024002  ...         AS            Australia
4 2001-11-01  ASN00028007  ...         AS            Australia
5 1991-04-01  USW00024151  ...         US        United States
6 1993-12-01  RSM00022641  ...         RS               Russia
[7 rows x 9 columns]

landtemps.dtypes

month_year       datetime64[ns]
stationed        object
avgtemp          float64
latitude         float64
longitude        float64
elevation        float64
station          object
countryid        object
country          object
dtype: object

landtemps.shape

(100000, 9)

Give the date column a more appropriate name and view the summary statistics for average monthly temperature:

landtemps.rename(columns={'month_year':'measuredate'}, inplace=True)
landtemps.dtypes

measuredate      datetime64[ns]
stationid        object
avgtemp          float64
latitude         float64
longitude        float64
elevation        float64
station          object
countryid        object
country          object
dtype:           object

landtemps.avgtemp.describe()

count   85,554.00
mean    10.92
std     11.52
min     -70.70
25%     3.46
50%     12.22
75%     19.57
max     39.95
Name: avgtemp, dtype: float64

Look for missing values for each column.

Use isnull, which returns True for each value that is missing for each column, and False when not missing. Chain this with sum to count the missing values for each column. (When working with Boolean values, sum treats True as 1 and False as 0. I will discuss method chaining in the There’s more... section of this recipe):

landtemps.isnull().sum()

measuredate    0
stationed      0
avgtemp        14446
latitude       0
longitude      0
elevation      0
station        0
countryid      0
country        5
dtype: int64

Remove rows with missing data for avgtemp.

Use the subset parameter to tell dropna to drop rows when avgtemp is missing. Set inplace to True. Leaving inplace at its default value of False would display the DataFrame, but the changes we have made would not be retained. Use the shape attribute of the DataFrame to get the number of rows and columns:

landtemps.dropna(subset=['avgtemp'], inplace=True)
landtemps.shape

(85554, 9)

That’s it! Importing CSV files into pandas is as simple as that.

How it works...

Almost all of the recipes in this book use the pandas library. We refer to it as pd to make it easier to reference later. This is customary. We also use float_format to display float values in a readable way and set_option to make the Terminal output wide enough to accommodate the number of variables.

Much of the work is done by the first line in Step 2. We use read_csv to load a pandas DataFrame in memory and call it landtemps. In addition to passing a filename, we set the names parameter to a list of our preferred column headings. We also tell read_csv to skip the first row, by setting skiprows to 1, since the original column headings are in the first row of the CSV file. If we do not tell it to skip the first row, read_csv will treat the header row in the file as actual data.

read_csv also solves a date conversion issue for us. We use the parse_dates parameter to ask it to convert the month and year columns to a date value.

Step 3 runs through a few standard data checks. We use head(7) to print out all columns for the first seven rows. We use the dtypes attribute of the DataFrame to show the data type of all columns. Each column has the expected data type. In pandas, character data has the object data type, a data type that allows for mixed values. shape returns a tuple, whose first element is the number of rows in the DataFrame (100,000 in this case) and whose second element is the number of columns (9).

When we used read_csv to parse the month and year columns, it gave the resulting column the name month_year. We used the rename method in Step 4 to give that column a more appropriate name. We need to specify inplace=True to replace the old column name with the new column name in memory. The describe method provides summary statistics on the avgtemp column.

Notice that the count for avgtemp indicates that there are 85,554 rows that have valid values for avgtemp. This is out of 100,000 rows for the whole DataFrame, as provided by the shape attribute. The listing of missing values for each column in Step 5 (landtemps.isnull().sum()) confirms this: 100,000 – 85,554 = 14,446.

Step 6 drops all rows where avgtemp is NaN. (The NaN value, not a number, is the pandas representation of missing values.) subset is used to indicate which column to check for missing values. The shape attribute for landtemps now indicates that there are 85,554 rows, which is what we would expect, given the previous count from describe.

There’s more...

If the file you are reading uses a delimiter other than a comma, such as a tab, this can be specified in the sep parameter of read_csv. When creating the pandas DataFrame, an index was also created. The numbers to the far left of the output when head was run are index values. Any number of rows can be specified for head. The default value is 5.

Instead of setting low_memory to False, to get pandas to make good guesses regarding data types, we could have set data types manually:

landtemps = pd.read_csv('data/landtempssample.csv',
    names=['stationid','year','month','avgtemp','latitude',
      'longitude','elevation','station','countryid','country'],
    skiprows=1,
    parse_dates=[['month','year']],
    dtype={'stationid':'object', 'avgtemp':'float64',
     'latitude':'float64','longitude':'float64',
     'elevation':'float64','station':'object',
     'countryid':'object','country':'object'},
    )
landtemps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype       
---  ------      --------------   -----       
 0   month_year  100000 non-null  datetime64[ns]
 1   stationid   100000 non-null  object      
 2   avgtemp     85554 non-null   float64     
 3   latitude    100000 non-null  float64     
 4   longitude   100000 non-null  float64     
 5   elevation   100000 non-null  float64     
 6   station     100000 non-null  object      
 7   countryid   100000 non-null  object      
 8   country     99995 non-null   object      
dtypes: datetime64[ns](1), float64(4), object(4)
memory usage: 6.9+ MB

The landtemps.isnull().sum() statement is an example of chaining methods. First, isnull returns a DataFrame of True and False values, resulting from testing whether each column value is null. The sum function takes that DataFrame and sums the True values for each column, interpreting the True values as 1 and the False values as 0. We would have obtained the same result if we had used the following two steps:

checknull = landtemps.isnull()
checknull.sum()

There is no hard and fast rule for when to chain methods and when not to do so. I find chaining helpful when the overall operation feels like a single step, even if it’s two or more steps mechanically. Chaining also has the side benefit of not creating extra objects that I might not need.

The dataset used in this recipe is just a sample from the full land temperatures database, with almost 17 million records. You can run the larger file if your machine can handle it, with the following code:

landtemps = pd.read_csv('data/landtemps.zip',
...   compression='zip', names=['stationid','year',
...     'month','avgtemp','latitude','longitude',
...     'elevation','station','countryid','country'],
...     skiprows=1,
...     parse_dates=[['month','year']],
...     low_memory=False)