Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Data Cleaning Cookbook

You're reading from   Python Data Cleaning Cookbook Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI

Arrow left icon
Product type Paperback
Published in May 2024
Publisher Packt
ISBN-13 9781803239873
Length 486 pages
Edition 2nd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Michael Walker Michael Walker
Author Profile Icon Michael Walker
Michael Walker
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Anticipating Data Cleaning Issues When Importing Tabular Data with pandas 2. Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data FREE CHAPTER 3. Taking the Measure of Your Data 4. Identifying Outliers in Subsets of Data 5. Using Visualizations for the Identification of Unexpected Values 6. Cleaning and Exploring Data with Series Operations 7. Identifying and Fixing Missing Values 8. Encoding, Transforming, and Scaling Features 9. Fixing Messy Data When Aggregating 10. Addressing Data Issues When Combining DataFrames 11. Tidying and Reshaping Data 12. Automate Data Cleaning with User-Defined Functions, Classes, and Pipelines 13. Index

Persisting tabular data

We persist data, copy it from memory to local or remote storage, for several reasons: to be able to access the data without having to repeat the steps we used to generate it, to share the data with others, or to make it available for use with different software. In this recipe, we save data that we have loaded into a pandas DataFrame as different file types (CSV, Excel, Pickle, and Feather).

Another important, but sometimes overlooked, reason to persist data is to preserve some segment of our data that needs to be examined more closely; perhaps it needs to be scrutinized by others before our analysis can be completed. For analysts who work with operational data in medium- to large-sized organizations, this process is part of the daily data-cleaning workflow.

In addition to these reasons for persisting data, our decisions about when and how to serialize data are shaped by several other factors: where we are in terms of our data analysis projects, the hardware and software resources of the machine(s) saving and reloading the data, and the size of our dataset. Analysts end up having to be much more intentional when saving data than they are when pressing Ctrl + S in their word-processing application.

Once we persist data, it is stored separately from the logic that we used to create it. I find this to be one of the most important threats to the integrity of our analysis. Often, we end up loading data that we saved some time in the past (a week ago? A month ago? A year ago?) and forget how a variable was defined and how it relates to other variables. If we are in the middle of a data-cleaning task, it is best not to persist our data, so long as our workstation and network can easily handle the burden of regenerating the data. It is a good idea to persist data only once we have reached milestones in our work.

Beyond the question of when to persist data, there is the question of how. If we are persisting it for our own reuse with the same software, it is best to save it in a binary format native to that software. That is pretty straightforward for tools such as SPSS, SAS, Stata, and R, but not so much for pandas. But that is good news in a way. We have lots of choices, from CSV and Excel to Pickle and Feather. We save as all these file types in this recipe.

Note

Pickle and Feather are binary file formats that can be used to store pandas DataFrames.

Getting ready

You will need to install Feather if you do not have it on your system. You can do that by entering pip install pyarrow in a Terminal window or powershell (in Windows). If you do not already have a subfolder named Views in your chapter 1 folder, you will need to create it in order to run the code for this recipe.

Data note

This dataset, taken from the Global Historical Climatology Network integrated database, is made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-monthly. I used the data from version 4. The data in this recipe uses a 100,000-row sample of the full dataset, which is also available in the repository.

How to do it…

We will load a CSV file into pandas and then save it as a Pickle and a Feather file. We will also save subsets of the data to the CSV and Excel formats:

  1. Import pandas and pyarrow.

pyarrow needs to be imported in order to save pandas to Feather:

import pandas as pd
import pyarrow
  1. Load the land temperatures CSV file into pandas, drop rows with missing data, and set an index:
    landtemps = \
    ...   pd.read_csv('data/landtempssample.csv',
    ...     names=['stationid','year','month','avgtemp',
    ...       'latitude','longitude','elevation',
    ...       'station','countryid','country'],
    ...     skiprows=1,
    ...     parse_dates=[['month','year']],
    ...     low_memory=False)
    landtemps.rename(columns={'month_year':'measuredate'}, inplace=True)
    landtemps.dropna(subset=['avgtemp'], inplace=True)
    landtemps.dtypes
    
    measuredate	datetime64[ns]
    stationid	object
    avgtemp	float64
    latitude	float64
    longitude	float64
    elevation	float64
    station	object
    countryid	object
    country	object
    dtype: object
    
    landtemps.set_index(['measuredate','stationid'], inplace=True)
    
  2. Write extreme values for temperature to CSV and Excel files.

Use the quantile method to select outlier rows, which are those at the 1 in 1,000 level at each end of the distribution:

extremevals = landtemps[(landtemps.avgtemp < landtemps.avgtemp.quantile(.001)) | (landtemps.avgtemp > landtemps.avgtemp.quantile(.999))]
extremevals.shape
(171, 7)
extremevals.sample(7)
                           avgtemp  ...   country
measuredate  stationid              ...       
2013-08-01	QAM00041170	35.30    ...	Qatar
2005-01-01	RSM00024966	-40.09   ...	Russia
1973-03-01	CA002401200	-40.26   ...	Canada
2007-06-01	KU000405820	37.35    ...	Kuwait
1987-07-01	SUM00062700	35.50    ...	Sudan
1998-02-01	RSM00025325	-35.71   ...	Russia
1968-12-01	RSM00024329	-43.20   ...	Russia
[7 rows x 7 columns]
extremevals.to_excel('views/tempext.xlsx')
extremevals.to_csv('views/tempext.csv')
  1. Save to Pickle and Feather files.

The index needs to be reset in order to save a Feather file:

landtemps.to_pickle('data/landtemps.pkl')
landtemps.reset_index(inplace=True)
landtemps.to_feather("data/landtemps.ftr")
  1. Load the Pickle and Feather files we just saved.

Note that our index was preserved when saving and loading the Pickle file:

landtemps = pd.read_pickle('data/landtemps.pkl')
landtemps.head(2).T
measuredate	2000-04-01	1940-05-01
stationid	USS0010K01S	CI000085406
avgtemp	5.27		18.04
latitude	39.90		-18.35
longitude	-110.75		-70.33
elevation	2,773.70	58.00
station	INDIAN_CANYON	ARICA
countryid	US		CI
country	United States	Chile
landtemps = pd.read_feather("data/landtemps.ftr")
landtemps.head(2).T
                               0                    1
measuredate	2000-04-01 00:00:00	1940-05-01 00:00:00
stationid	USS0010K01S		CI000085406
avgtemp	5.27			18.04
latitude	39.90			-18.35
longitude	-110.75			-70.33
elevation	2,773.70		58.00
station	INDIAN_CANYON		ARICA
countryid	US			CI
country	United States		Chile

The previous steps demonstrated how to serialize pandas DataFrames using two different formats, Pickle and Feather.

How it works...

Persisting pandas data is quite straightforward. DataFrames have the to_csv, to_excel, to_pickle, and to_feather methods. Pickling preserves our index.

There’s more...

The advantage of storing data in CSV files is that saving it uses up very little additional memory. The disadvantage is that writing CSV files is quite slow, and we lose important metadata, such as data types. (read_csv can often figure out the data type when we reload the file, but not always.) Pickle files keep that data but can burden a system that is low on resources when serializing. Feather is easier on resources and can be easily loaded in R as well as Python, but we have to sacrifice our index in order to serialize. Also, the authors of Feather make no promises regarding long-term support.

You may have noticed that I do not make a global recommendation about what to use for data serialization – other than to limit your persistence of full datasets to project milestones. This is definitely one of those “right tools for the right job” kind of situations. I use CSV or Excel files when I want to share a segment of a file with colleagues for discussion. I use Feather for ongoing Python projects, particularly when I am using a machine with sub-par RAM and an outdated chip and also using R. When I am wrapping up a project, I pickle the DataFrames.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime