You're reading from Python Data Cleaning Cookbook Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI

Product type Paperback

Published in May 2024

Publisher Packt

ISBN-13 9781803239873

Length 486 pages

Edition 2nd Edition

Languages

Python

Tools

Matplotlib

Concepts

Data Analysis

Author (1):

Michael Walker

View More author details

Table of Contents (14) Chapters

Preface

1. Anticipating Data Cleaning Issues When Importing Tabular Data with pandas FREE CHAPTER

2. Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data

3. Taking the Measure of Your Data

4. Identifying Outliers in Subsets of Data

5. Using Visualizations for the Identification of Unexpected Values

6. Cleaning and Exploring Data with Series Operations

7. Identifying and Fixing Missing Values

8. Encoding, Transforming, and Scaling Features

9. Fixing Messy Data When Aggregating

10. Addressing Data Issues When Combining DataFrames

11. Tidying and Reshaping Data

12. Automate Data Cleaning with User-Defined Functions, Classes, and Pipelines

13. Index

Persisting tabular data

We persist data, copy it from memory to local or remote storage, for several reasons: to be able to access the data without having to repeat the steps we used to generate it, to share the data with others, or to make it available for use with different software. In this recipe, we save data that we have loaded into a pandas DataFrame as different file types (CSV, Excel, Pickle, and Feather).

Another important, but sometimes overlooked, reason to persist data is to preserve some segment of our data that needs to be examined more closely; perhaps it needs to be scrutinized by others before our analysis can be completed. For analysts who work with operational data in medium- to large-sized organizations, this process is part of the daily data-cleaning workflow.

In addition to these reasons for persisting data, our decisions about when and how to serialize data are shaped by several other factors: where we are in terms of our data analysis projects, the hardware and software resources of the machine(s) saving and reloading the data, and the size of our dataset. Analysts end up having to be much more intentional when saving data than they are when pressing Ctrl + S in their word-processing application.

Once we persist data, it is stored separately from the logic that we used to create it. I find this to be one of the most important threats to the integrity of our analysis. Often, we end up loading data that we saved some time in the past (a week ago? A month ago? A year ago?) and forget how a variable was defined and how it relates to other variables. If we are in the middle of a data-cleaning task, it is best not to persist our data, so long as our workstation and network can easily handle the burden of regenerating the data. It is a good idea to persist data only once we have reached milestones in our work.

Beyond the question of when to persist data, there is the question of how. If we are persisting it for our own reuse with the same software, it is best to save it in a binary format native to that software. That is pretty straightforward for tools such as SPSS, SAS, Stata, and R, but not so much for pandas. But that is good news in a way. We have lots of choices, from CSV and Excel to Pickle and Feather. We save as all these file types in this recipe.

Note

Pickle and Feather are binary file formats that can be used to store pandas DataFrames.

Getting ready

You will need to install Feather if you do not have it on your system. You can do that by entering pip install pyarrow in a Terminal window or powershell (in Windows). If you do not already have a subfolder named Views in your chapter 1 folder, you will need to create it in order to run the code for this recipe.

Data note

This dataset, taken from the Global Historical Climatology Network integrated database, is made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-monthly. I used the data from version 4. The data in this recipe uses a 100,000-row sample of the full dataset, which is also available in the repository.

How to do it…

We will load a CSV file into pandas and then save it as a Pickle and a Feather file. We will also save subsets of the data to the CSV and Excel formats:

Import pandas and pyarrow.

pyarrow needs to be imported in order to save pandas to Feather:

import pandas as pd
import pyarrow

Load the land temperatures CSV file into pandas, drop rows with missing data, and set an index:

landtemps = \
...   pd.read_csv('data/landtempssample.csv',
...     names=['stationid','year','month','avgtemp',
...       'latitude','longitude','elevation',
...       'station','countryid','country'],
...     skiprows=1,
...     parse_dates=[['month','year']],
...     low_memory=False)
landtemps.rename(columns={'month_year':'measuredate'}, inplace=True)
landtemps.dropna(subset=['avgtemp'], inplace=True)
landtemps.dtypes

measuredate	datetime64[ns]
stationid	object
avgtemp	float64
latitude	float64
longitude	float64
elevation	float64
station	object
countryid	object
country	object
dtype: object

landtemps.set_index(['measuredate','stationid'], inplace=True)

Write extreme values for temperature to CSV and Excel files.

Use the quantile method to select outlier rows, which are those at the 1 in 1,000 level at each end of the distribution:

extremevals = landtemps[(landtemps.avgtemp < landtemps.avgtemp.quantile(.001)) | (landtemps.avgtemp > landtemps.avgtemp.quantile(.999))]
extremevals.shape

(171, 7)

extremevals.sample(7)

                           avgtemp  ...   country
measuredate  stationid              ...       
2013-08-01	QAM00041170	35.30    ...	Qatar
2005-01-01	RSM00024966	-40.09   ...	Russia
1973-03-01	CA002401200	-40.26   ...	Canada
2007-06-01	KU000405820	37.35    ...	Kuwait
1987-07-01	SUM00062700	35.50    ...	Sudan
1998-02-01	RSM00025325	-35.71   ...	Russia
1968-12-01	RSM00024329	-43.20   ...	Russia
[7 rows x 7 columns]

extremevals.to_excel('views/tempext.xlsx')
extremevals.to_csv('views/tempext.csv')

Save to Pickle and Feather files.

The index needs to be reset in order to save a Feather file:

landtemps.to_pickle('data/landtemps.pkl')
landtemps.reset_index(inplace=True)
landtemps.to_feather("data/landtemps.ftr")

Load the Pickle and Feather files we just saved.

Note that our index was preserved when saving and loading the Pickle file:

landtemps = pd.read_pickle('data/landtemps.pkl')
landtemps.head(2).T

measuredate	2000-04-01	1940-05-01
stationid	USS0010K01S	CI000085406
avgtemp	5.27		18.04
latitude	39.90		-18.35
longitude	-110.75		-70.33
elevation	2,773.70	58.00
station	INDIAN_CANYON	ARICA
countryid	US		CI
country	United States	Chile

landtemps = pd.read_feather("data/landtemps.ftr")
landtemps.head(2).T

                               0                    1
measuredate	2000-04-01 00:00:00	1940-05-01 00:00:00
stationid	USS0010K01S		CI000085406
avgtemp	5.27			18.04
latitude	39.90			-18.35
longitude	-110.75			-70.33
elevation	2,773.70		58.00
station	INDIAN_CANYON		ARICA
countryid	US			CI
country	United States		Chile

The previous steps demonstrated how to serialize pandas DataFrames using two different formats, Pickle and Feather.

How it works...

Persisting pandas data is quite straightforward. DataFrames have the to_csv, to_excel, to_pickle, and to_feather methods. Pickling preserves our index.

There’s more...

The advantage of storing data in CSV files is that saving it uses up very little additional memory. The disadvantage is that writing CSV files is quite slow, and we lose important metadata, such as data types. (read_csv can often figure out the data type when we reload the file, but not always.) Pickle files keep that data but can burden a system that is low on resources when serializing. Feather is easier on resources and can be easily loaded in R as well as Python, but we have to sacrifice our index in order to serialize. Also, the authors of Feather make no promises regarding long-term support.

You may have noticed that I do not make a global recommendation about what to use for data serialization – other than to limit your persistence of full datasets to project milestones. This is definitely one of those “right tools for the right job” kind of situations. I use CSV or Excel files when I want to share a segment of a file with colleagues for discussion. I use Feather for ongoing Python projects, particularly when I am using a machine with sub-par RAM and an outdated chip and also using R. When I am wrapping up a project, I pickle the DataFrames.