Persisting tabular data
We persist data, copy it from memory to local or remote storage, for several reasons: to be able to access the data without having to repeat the steps we used to generate it, to share the data with others, or to make it available for use with different software. In this recipe, we save data that we have loaded into a pandas DataFrame as different file types (CSV, Excel, Pickle, and Feather).
Another important, but sometimes overlooked, reason to persist data is to preserve some segment of our data that needs to be examined more closely; perhaps it needs to be scrutinized by others before our analysis can be completed. For analysts who work with operational data in medium- to large-sized organizations, this process is part of the daily data-cleaning workflow.
In addition to these reasons for persisting data, our decisions about when and how to serialize data are shaped by several other factors: where we are in terms of our data analysis projects, the hardware and software resources of the machine(s) saving and reloading the data, and the size of our dataset. Analysts end up having to be much more intentional when saving data than they are when pressing Ctrl + S in their word-processing application.
Once we persist data, it is stored separately from the logic that we used to create it. I find this to be one of the most important threats to the integrity of our analysis. Often, we end up loading data that we saved some time in the past (a week ago? A month ago? A year ago?) and forget how a variable was defined and how it relates to other variables. If we are in the middle of a data-cleaning task, it is best not to persist our data, so long as our workstation and network can easily handle the burden of regenerating the data. It is a good idea to persist data only once we have reached milestones in our work.
Beyond the question of when to persist data, there is the question of how. If we are persisting it for our own reuse with the same software, it is best to save it in a binary format native to that software. That is pretty straightforward for tools such as SPSS, SAS, Stata, and R, but not so much for pandas. But that is good news in a way. We have lots of choices, from CSV and Excel to Pickle and Feather. We save as all these file types in this recipe.
Note
Pickle and Feather are binary file formats that can be used to store pandas DataFrames.
Getting ready
You will need to install Feather if you do not have it on your system. You can do that by entering pip install pyarrow
in a Terminal window or powershell
(in Windows). If you do not already have a subfolder named Views
in your chapter 1
folder, you will need to create it in order to run the code for this recipe.
Data note
This dataset, taken from the Global Historical Climatology Network integrated database, is made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-monthly. I used the data from version 4. The data in this recipe uses a 100,000-row sample of the full dataset, which is also available in the repository.
How to do it…
We will load a CSV file into pandas and then save it as a Pickle and a Feather file. We will also save subsets of the data to the CSV and Excel formats:
- Import
pandas
andpyarrow
.
pyarrow
needs to be imported in order to save pandas to Feather:
import pandas as pd
import pyarrow
- Load the land temperatures CSV file into pandas, drop rows with missing data, and set an index:
landtemps = \ ... pd.read_csv('data/landtempssample.csv', ... names=['stationid','year','month','avgtemp', ... 'latitude','longitude','elevation', ... 'station','countryid','country'], ... skiprows=1, ... parse_dates=[['month','year']], ... low_memory=False) landtemps.rename(columns={'month_year':'measuredate'}, inplace=True) landtemps.dropna(subset=['avgtemp'], inplace=True) landtemps.dtypes
measuredate datetime64[ns] stationid object avgtemp float64 latitude float64 longitude float64 elevation float64 station object countryid object country object dtype: object
landtemps.set_index(['measuredate','stationid'], inplace=True)
- Write extreme values for
temperature
to CSV and Excel files.
Use the quantile
method to select outlier rows, which are those at the 1 in 1,000 level at each end of the distribution:
extremevals = landtemps[(landtemps.avgtemp < landtemps.avgtemp.quantile(.001)) | (landtemps.avgtemp > landtemps.avgtemp.quantile(.999))]
extremevals.shape
(171, 7)
extremevals.sample(7)
avgtemp ... country
measuredate stationid ...
2013-08-01 QAM00041170 35.30 ... Qatar
2005-01-01 RSM00024966 -40.09 ... Russia
1973-03-01 CA002401200 -40.26 ... Canada
2007-06-01 KU000405820 37.35 ... Kuwait
1987-07-01 SUM00062700 35.50 ... Sudan
1998-02-01 RSM00025325 -35.71 ... Russia
1968-12-01 RSM00024329 -43.20 ... Russia
[7 rows x 7 columns]
extremevals.to_excel('views/tempext.xlsx')
extremevals.to_csv('views/tempext.csv')
- Save to Pickle and Feather files.
The index needs to be reset in order to save a Feather file:
landtemps.to_pickle('data/landtemps.pkl')
landtemps.reset_index(inplace=True)
landtemps.to_feather("data/landtemps.ftr")
- Load the Pickle and Feather files we just saved.
Note that our index was preserved when saving and loading the Pickle file:
landtemps = pd.read_pickle('data/landtemps.pkl')
landtemps.head(2).T
measuredate 2000-04-01 1940-05-01
stationid USS0010K01S CI000085406
avgtemp 5.27 18.04
latitude 39.90 -18.35
longitude -110.75 -70.33
elevation 2,773.70 58.00
station INDIAN_CANYON ARICA
countryid US CI
country United States Chile
landtemps = pd.read_feather("data/landtemps.ftr")
landtemps.head(2).T
0 1
measuredate 2000-04-01 00:00:00 1940-05-01 00:00:00
stationid USS0010K01S CI000085406
avgtemp 5.27 18.04
latitude 39.90 -18.35
longitude -110.75 -70.33
elevation 2,773.70 58.00
station INDIAN_CANYON ARICA
countryid US CI
country United States Chile
The previous steps demonstrated how to serialize pandas DataFrames using two different formats, Pickle and Feather.
How it works...
Persisting pandas data is quite straightforward. DataFrames have the to_csv
, to_excel
, to_pickle
, and to_feather
methods. Pickling preserves our index.
There’s more...
The advantage of storing data in CSV files is that saving it uses up very little additional memory. The disadvantage is that writing CSV files is quite slow, and we lose important metadata, such as data types. (read_csv
can often figure out the data type when we reload the file, but not always.) Pickle files keep that data but can burden a system that is low on resources when serializing. Feather is easier on resources and can be easily loaded in R as well as Python, but we have to sacrifice our index in order to serialize. Also, the authors of Feather make no promises regarding long-term support.
You may have noticed that I do not make a global recommendation about what to use for data serialization – other than to limit your persistence of full datasets to project milestones. This is definitely one of those “right tools for the right job” kind of situations. I use CSV or Excel files when I want to share a segment of a file with colleagues for discussion. I use Feather for ongoing Python projects, particularly when I am using a machine with sub-par RAM and an outdated chip and also using R. When I am wrapping up a project, I pickle the DataFrames.