Importing SPSS, Stata, and SAS data
We will use pyreadstat
to read data from three popular statistical packages into pandas. The key advantage of pyreadstat
is that it allows data analysts to import data from these packages without losing metadata, such as variable and value labels.
The SPSS, Stata, and SAS data files we receive often come to us with the data issues of CSV and Excel files and SQL databases having been resolved. We do not typically have the invalid column names, changes in data types, and unclear missing values that we can get with CSV or Excel files, nor do we usually get the detachment of data from business logic, such as the meaning of data codes, that we often get with SQL data. When someone or some organization shares a data file from one of these packages with us, they have often added variable labels and value labels for categorical data. For example, a hypothetical data column called presentsat
has the overall satisfaction with presentation
variable label and 1
–5
value labels, with 1
being not at all satisfied and 5
being highly satisfied.
The challenge is retaining that metadata when importing data from those systems into pandas. There is no precise equivalent to variable and value labels in pandas, and built-in tools for importing SAS, Stata, and SAS data lose the metadata. In this recipe, we will use pyreadstat
to load variable and value label information and use a couple of techniques to represent that information in pandas.
Getting ready
This recipe assumes you have installed the pyreadstat
package. If it is not installed, you can install it with pip
. From the Terminal, or Powershell (in Windows), enter pip install pyreadstat
. You will need the SPSS, Stata, and SAS data files for this recipe to run the code.
We will work with data from the United States National Longitudinal Surveys (NLS) of Youth.
Data note
The NLS of Youth is conducted by the United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997. Each survey respondent was high school age when they first completed the survey, having been born between 1980 and 1985. There were annual follow-up surveys each year through 2023. For this recipe, I pulled 42 variables on grades, employment, income, and attitudes toward government, from the hundreds of data items on the survey. Separate files for SPSS, Stata, and SAS can be downloaded from the repository.
The original NLS data can be downloaded from https://www.nlsinfo.org/investigator/pages/search, along with code for creating SPSS, Stata, or SAS files from the ASCII data files included in the download.
How to do it...
We will import data from SPSS, Stata, and SAS, retaining metadata such as value labels:
- Import
pandas
,numpy
, andpyreadstat
.
This step assumes that you have installed pyreadstat
:
import pandas as pd
import numpy as np
import pyreadstat
- Retrieve the SPSS data.
Pass a path and filename to the read_sav
method of pyreadstat
. Display the first few rows and a frequency distribution. Note that the column names and value labels are non-descriptive, and that read_sav
returns both a pandas DataFrame and a meta
object:
nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav')
nls97spss.dtypes
R0000100 float64
R0536300 float64
R0536401 float64
...
U2962900 float64
U2963000 float64
Z9063900 float64
dtype: object
nls97spss.head()
R0000100 R0536300 ... U2963000 Z9063900
0 1 2 ... nan 52
1 2 1 ... 6 0
2 3 2 ... 6 0
3 4 2 ... 6 4
4 5 1 ... 5 12
[5 rows x 42 columns]
nls97spss['R0536300'].value_counts(normalize=True)
1.00 0.51
2.00 0.49
Name: R0536300, dtype: float64
- Grab the metadata to improve column labels and value labels.
The metaspss
object created when we called read_sav
has the column labels and the value labels from the SPSS file. Use the variable_value_labels
dictionary to map values to value labels for one column (R0536300
). (This does not change the data. It only improves our display when we run value_counts
.) Use the set_value_labels
method to actually apply the value labels to the DataFrame:
metaspss.variable_value_labels['R0536300']
{0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'}
nls97spss['R0536300'].\
... map(metaspss.variable_value_labels['R0536300']).\
... value_counts(normalize=True)
Male 0.51
Female 0.49
Name: R0536300, dtype: float64
nls97spss = pyreadstat.set_value_labels(nls97spss, metaspss, formats_as_category=True)
- Use column labels in the metadata to rename the columns.
To use the column labels from metaspss
in our DataFrame, we can simply assign the column labels in metaspss
to our DataFrame’s column names. Clean up the column names a bit by changing them to lowercase, changing spaces to underscores, and removing all remaining non-alphanumeric characters:
nls97spss.columns = metaspss.column_labels
nls97spss['KEY!SEX (SYMBOL) 1997'].value_counts(normalize=True)
Male 0.51
Female 0.49
Name: KEY!SEX (SYMBOL) 1997, dtype: float64
nls97spss.dtypes
PUBID - YTH ID CODE 1997 float64
KEY!SEX (SYMBOL) 1997 category
KEY!BDATE M/Y (SYMBOL) 1997 float64
KEY!BDATE M/Y (SYMBOL) 1997 float64
CV_SAMPLE_TYPE 1997 category
KEY!RACE_ETHNICITY (SYMBOL) 1997 category
"... abbreviated to save space"
HRS/WK R WATCHES TELEVISION 2017 category
HRS/NIGHT R SLEEPS 2017 float64
CVC_WKSWK_YR_ALL L99 float64
dtype: object
nls97spss.columns = nls97spss.columns.\
... str.lower().\
... str.replace(' ','_').\
... str.replace('[^a-z0-9_]', '', regex=True)
nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)
- Simplify the process by applying the value labels from the beginning.
The data values can actually be applied in the initial call to read_sav
by setting apply_value_formats
to True
. This eliminates the need to call the set_value_labels
function later:
nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav', apply_value_formats=True, formats_as_category=True)
nls97spss.columns = metaspss.column_labels
nls97spss.columns = nls97spss.columns.\
... str.lower().\
... str.replace(' ','_').\
... str.replace('[^a-z0-9_]', '', regex=True)
- Show the columns and a few rows:
nls97spss.dtypes
pubid__yth_id_code_1997 float64 keysex_symbol_1997 category keybdate_my_symbol_1997 float64 keybdate_my_symbol_1997 float64 hrsnight_r_sleeps_2017 float64 cvc_wkswk_yr_all_l99 float64 dtype: object
nls97spss.head()
pubid__yth_id_code_1997 keysex_symbol_1997 ... \ 0 1 Female ... 1 2 Male ... 2 3 Female ... 3 4 Female ... 4 5 Male ... hrsnight_r_sleeps_2017 cvc_wkswk_yr_all_l99 0 nan 52 1 6 0 2 6 0 3 6 4 4 5 12 [5 rows x 42 columns]
- Run frequencies on one of the columns, and set the index:
nls97spss.govt_responsibility__provide_jobs_2006.\ ... value_counts(sort=False)
Definitely should be 454 Definitely should not be 300 Probably should be 617 Probably should not be 462 Name: govt_responsibility__provide_jobs_2006, dtype: int64
nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)
- That demonstrated how to convert data from SPSS. Let’s try that with Stata data.
- Import the Stata data, apply value labels, and improve the column headings.
Use the same methods for the Stata data that we used for the SPSS data:
nls97stata, metastata = pyreadstat.read_dta('data/nls97.dta', apply_value_formats=True, formats_as_category=True)
nls97stata.columns = metastata.column_labels
nls97stata.columns = nls97stata.columns.\
... str.lower().\
... str.replace(' ','_').\
... str.replace('[^a-z0-9_]', '', regex=True)
nls97stata.dtypes
pubid__yth_id_code_1997 float64
keysex_symbol_1997 category
keybdate_my_symbol_1997 float64
keybdate_my_symbol_1997 float64
hrsnight_r_sleeps_2017 float64
cvc_wkswk_yr_all_l99 float64
dtype: object
- View a few rows of the data and run frequencies:
nls97stata.head()
pubid__yth_id_code_1997 keysex_symbol_1997 ... \ 0 1 Female ... 1 2 Male ... 2 3 Female ... 3 4 Female ... 4 5 Male ... hrsnight_r_sleeps_2017 cvc_wkswk_yr_all_l99 0 -5 52 1 6 0 2 6 0 3 6 4 4 5 12 [5 rows x 42 columns]
nls97stata.govt_responsibility__provide_jobs_2006.\ ... value_counts(sort=False)
-5.0 1425 -4.0 5665 -2.0 56 -1.0 5 Definitely should be 454 Definitely should not be 300 Probably should be 617 Probably should not be 462 Name: govt_responsibility__provide_jobs_2006, dtype: int64
- Fix the logical missing values that show up with the Stata data and set an index. We can use the
replace
method to set any value that is between–9
and–1
in any column to missing:nls97stata.min(numeric_only=True)
pubid__yth_id_code_1997 1 keybdate_my_symbol_1997 1 keybdate_my_symbol_1997 1,980 trans_sat_verbal_hstr -4 trans_sat_math_hstr -4 trans_crd_gpa_overall_hstr -9 trans_crd_gpa_eng_hstr -9 trans_crd_gpa_math_hstr -9 trans_crd_gpa_lp_sci_hstr -9 cv_ba_credits_l1_2011 -5 cv_bio_child_hh_2017 -5 cv_bio_child_nr_2017 -5 hrsnight_r_sleeps_2017 -5 cvc_wkswk_yr_all_l99 -4 dtype: float64
nls97stata.replace(list(range(-9,0)), np.nan, inplace=True) nls97stata.min(numeric_only=True)
pubid__yth_id_code_1997 1 keybdate_my_symbol_1997 1 keybdate_my_symbol_1997 1,980 trans_sat_verbal_hstr 14 trans_sat_math_hstr 7 trans_crd_gpa_overall_hstr 10 trans_crd_gpa_eng_hstr 0 trans_crd_gpa_math_hstr 0 trans_crd_gpa_lp_sci_hstr 0 cv_ba_credits_l1_2011 0 cv_bio_child_hh_2017 0 cv_bio_child_nr_2017 0 hrsnight_r_sleeps_2017 0 cvc_wkswk_yr_all_l99 0 dtype: float64
nls97stata.set_index('pubid__yth_id_code_1997', inplace=True)
The process is fairly similar when working with SAS data files, as the next few steps illustrate.
- Retrieve the SAS data, using the SAS catalog file for value labels:
The data values for SAS are stored in a catalog file. Setting the catalog file path and filename retrieves the value labels and applies them:
nls97sas, metasas = pyreadstat.read_sas7bdat('data/nls97.sas7bdat', catalog_file='data/nlsformats3.sas7bcat', formats_as_category=True)
nls97sas.columns = metasas.column_labels
nls97sas.columns = nls97sas.columns.\
... str.lower().\
... str.replace(' ','_').\
... str.replace('[^a-z0-9_]', '', regex=True)
nls97sas.head()
pubid__yth_id_code_1997 keysex_symbol_1997 ... \
0 1 Female ...
1 2 Male ...
2 3 Female ...
3 4 Female ...
4 5 Male ...
hrsnight_r_sleeps_2017 cvc_wkswk_yr_all_l99
0 nan 52
1 6 0
2 6 0
3 6 4
4 5 12
[5 rows x 42 columns]
nls97sas.keysex_symbol_1997.value_counts()
Male 4599
Female 4385
Name: keysex_symbol_1997, dtype: int64
nls97sas.set_index('pubid__yth_id_code_1997', inplace=True)
This demonstrates how to import SPSS, SAS, and Stata data without losing important metadata.
How it works...
The read_sav
, read_dta
, and read_sas7bdat
methods of Pyreadstat
, for SPSS, Stata, and SAS data files, respectively, work in a similar manner. Value labels can be applied when reading in the data by setting apply_value_formats
to True
for SPSS and Stata files (Steps 5 and 8), or by providing a catalog file path and filename for SAS (Step 12).
We can set formats_as_category
to True
to change the data type to category
for those columns where the data values will change. The meta object has the column names and the column labels from the statistical package, so metadata column labels can be assigned to pandas DataFrame column names at any point (nls97spss.columns = metaspss.column_labels
). We can even revert to the original column headings after assigning meta column labels to them by setting pandas column names to the metadata column names (nls97spss.columns = metaspss.column_names
).
In Step 3, we looked at some of the SPSS data before applying value labels. We looked at the dictionary for one variable (metaspss.variable_value_labels['R0536300']
), but we could have viewed it for all variables (metaspss.variable_value_labels
). When we are satisfied that the labels make sense, we can set them by calling the set_value_labels
function. This is a good approach when you do not know the data well and want to inspect the labels before applying them.
The column labels from the meta object are often a better choice than the original column headings. Column headings can be quite cryptic, particularly when the SPSS, Stata, or SAS file is based on a large survey, as in this example. However, the labels are not usually ideal for column headings either. They sometimes have spaces, capitalization that is not helpful, and non-alphanumeric characters. We chain some string operations to switch to lowercase, replace spaces with underscores, and remove non-alphanumeric characters.
Handling missing values is not always straightforward with these data files, since there are often many reasons why data is missing. If the file is from a survey, the missing value may be because of a survey skip pattern, or a respondent failed to respond, or the response was invalid, and so on. The NLS has nine possible values for missing, from –1
to –9
. The SPSS import automatically set those values to NaN
, while the Stata import retained the original values. (We could have gotten the SPSS import to retain those values by setting user_missing
to True
.) For the Stata data, we need to tell it to replace all values from –1
to –9
with NaN
. We do this by using the DataFrame’s replace
function and passing it a list of integers from –9
to –1
(list(range(-9,0))
).
There’s more…
You may have noticed similarities between this recipe and the previous one in terms of how value labels are set. The set_value_labels
function is like the DataFrame replace
operation we used to set value labels in that recipe. We passed a dictionary to replace
that mapped columns to value labels. The set_value_labels
function in this recipe essentially does the same thing, using the variable_value_labels
property of the meta object as the dictionary.
Data from statistical packages is often not as well structured as SQL databases tend to be in one significant way. Since they are designed to facilitate analysis, they often violate database normalization rules. There is often an implied relational structure that might have to be unflattened at some point. For example, the data may combine individual and event-level data – a person and hospital visits, a brown bear and the date it emerged from hibernation. Often, this data will need to be reshaped for some aspects of the analysis.
See also
The pyreadstat
package is nicely documented at https://github.com/Roche/pyreadstat. The package has many useful options for selecting columns and handling missing data that space did not permit me to demonstrate in this recipe. In Chapter 11, Tidying and Reshaping Data, we will examine how to normalize data that may have been flattened for analytical purposes.