You're reading from Python Data Cleaning Cookbook Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI

Product type Paperback

Published in May 2024

Publisher Packt

ISBN-13 9781803239873

Length 486 pages

Edition 2nd Edition

Languages

Python

Tools

Matplotlib

Concepts

Data Analysis

Author (1):

Michael Walker

View More author details

Table of Contents (14) Chapters

Preface

1. Anticipating Data Cleaning Issues When Importing Tabular Data with pandas

2. Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data FREE CHAPTER

3. Taking the Measure of Your Data

4. Identifying Outliers in Subsets of Data

5. Using Visualizations for the Identification of Unexpected Values

6. Cleaning and Exploring Data with Series Operations

7. Identifying and Fixing Missing Values

8. Encoding, Transforming, and Scaling Features

9. Fixing Messy Data When Aggregating

10. Addressing Data Issues When Combining DataFrames

11. Tidying and Reshaping Data

12. Automate Data Cleaning with User-Defined Functions, Classes, and Pipelines

13. Index

Importing SPSS, Stata, and SAS data

We will use pyreadstat to read data from three popular statistical packages into pandas. The key advantage of pyreadstat is that it allows data analysts to import data from these packages without losing metadata, such as variable and value labels.

The SPSS, Stata, and SAS data files we receive often come to us with the data issues of CSV and Excel files and SQL databases having been resolved. We do not typically have the invalid column names, changes in data types, and unclear missing values that we can get with CSV or Excel files, nor do we usually get the detachment of data from business logic, such as the meaning of data codes, that we often get with SQL data. When someone or some organization shares a data file from one of these packages with us, they have often added variable labels and value labels for categorical data. For example, a hypothetical data column called presentsat has the overall satisfaction with presentation variable label and 1–5 value labels, with 1 being not at all satisfied and 5 being highly satisfied.

The challenge is retaining that metadata when importing data from those systems into pandas. There is no precise equivalent to variable and value labels in pandas, and built-in tools for importing SAS, Stata, and SAS data lose the metadata. In this recipe, we will use pyreadstat to load variable and value label information and use a couple of techniques to represent that information in pandas.

Getting ready

This recipe assumes you have installed the pyreadstat package. If it is not installed, you can install it with pip. From the Terminal, or Powershell (in Windows), enter pip install pyreadstat. You will need the SPSS, Stata, and SAS data files for this recipe to run the code.

We will work with data from the United States National Longitudinal Surveys (NLS) of Youth.

Data note

The NLS of Youth is conducted by the United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997. Each survey respondent was high school age when they first completed the survey, having been born between 1980 and 1985. There were annual follow-up surveys each year through 2023. For this recipe, I pulled 42 variables on grades, employment, income, and attitudes toward government, from the hundreds of data items on the survey. Separate files for SPSS, Stata, and SAS can be downloaded from the repository.

The original NLS data can be downloaded from https://www.nlsinfo.org/investigator/pages/search, along with code for creating SPSS, Stata, or SAS files from the ASCII data files included in the download.

How to do it...

We will import data from SPSS, Stata, and SAS, retaining metadata such as value labels:

Import pandas, numpy, and pyreadstat.

This step assumes that you have installed pyreadstat:

import pandas as pd
import numpy as np
import pyreadstat

Retrieve the SPSS data.

Pass a path and filename to the read_sav method of pyreadstat. Display the first few rows and a frequency distribution. Note that the column names and value labels are non-descriptive, and that read_sav returns both a pandas DataFrame and a meta object:

nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav')
nls97spss.dtypes

R0000100	float64
R0536300	float64
R0536401	float64
...
U2962900	float64
U2963000	float64
Z9063900	float64
dtype: object

nls97spss.head()

   R0000100  R0536300  ...  U2963000  Z9063900
0	1	2         ...  nan       52
1	2	1         ...  6         0
2	3	2         ...  6         0
3	4	2         ...  6         4
4	5	1         ...  5         12
[5 rows x 42 columns]

nls97spss['R0536300'].value_counts(normalize=True)

1.00	0.51
2.00	0.49
Name: R0536300, dtype: float64

Grab the metadata to improve column labels and value labels.

The metaspss object created when we called read_sav has the column labels and the value labels from the SPSS file. Use the variable_value_labels dictionary to map values to value labels for one column (R0536300). (This does not change the data. It only improves our display when we run value_counts.) Use the set_value_labels method to actually apply the value labels to the DataFrame:

metaspss.variable_value_labels['R0536300']

{0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'}

nls97spss['R0536300'].\
...   map(metaspss.variable_value_labels['R0536300']).\
...   value_counts(normalize=True)

Male		0.51
Female	0.49
Name: R0536300, dtype: float64

nls97spss = pyreadstat.set_value_labels(nls97spss, metaspss, formats_as_category=True)

Use column labels in the metadata to rename the columns.

To use the column labels from metaspss in our DataFrame, we can simply assign the column labels in metaspss to our DataFrame’s column names. Clean up the column names a bit by changing them to lowercase, changing spaces to underscores, and removing all remaining non-alphanumeric characters:

nls97spss.columns = metaspss.column_labels
nls97spss['KEY!SEX (SYMBOL) 1997'].value_counts(normalize=True)

Male		0.51
Female	0.49
Name: KEY!SEX (SYMBOL) 1997, dtype: float64

nls97spss.dtypes

PUBID - YTH ID CODE 		1997	float64
KEY!SEX (SYMBOL) 			1997	category
KEY!BDATE M/Y (SYMBOL)		1997	float64
KEY!BDATE M/Y (SYMBOL) 		1997	float64
CV_SAMPLE_TYPE 			1997	category
KEY!RACE_ETHNICITY (SYMBOL) 	1997	category
"... abbreviated to save space"
HRS/WK R WATCHES TELEVISION 	2017	category
HRS/NIGHT R SLEEPS 		 	2017	float64
CVC_WKSWK_YR_ALL L99			float64
dtype: object

nls97spss.columns = nls97spss.columns.\
...     str.lower().\
...     str.replace(' ','_').\
...     str.replace('[^a-z0-9_]', '', regex=True)
nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)

Simplify the process by applying the value labels from the beginning.

The data values can actually be applied in the initial call to read_sav by setting apply_value_formats to True. This eliminates the need to call the set_value_labels function later:

nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav', apply_value_formats=True, formats_as_category=True)
nls97spss.columns = metaspss.column_labels
nls97spss.columns = nls97spss.columns.\
...   str.lower().\
...   str.replace(' ','_').\
...   str.replace('[^a-z0-9_]', '', regex=True)

Show the columns and a few rows:

nls97spss.dtypes

pubid__yth_id_code_1997	float64
keysex_symbol_1997		category
keybdate_my_symbol_1997	float64
keybdate_my_symbol_1997	float64
hrsnight_r_sleeps_2017	float64
cvc_wkswk_yr_all_l99	float64
dtype: object

nls97spss.head()

   pubid__yth_id_code_1997 keysex_symbol_1997  ...  \
0	1	Female  ... 
1	2	Male  ... 
2	3	Female  ... 
3	4	Female  ... 
4	5	Male  ... 
   hrsnight_r_sleeps_2017  cvc_wkswk_yr_all_l99
0	nan	52
1	6	0
2	6	0
3	6	4
4	5	12
[5 rows x 42 columns]

Run frequencies on one of the columns, and set the index:

nls97spss.govt_responsibility__provide_jobs_2006.\
...   value_counts(sort=False)

Definitely should be	454
Definitely should not be	300
Probably should be		617
Probably should not be	462
Name: govt_responsibility__provide_jobs_2006, dtype: int64

nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)

That demonstrated how to convert data from SPSS. Let’s try that with Stata data.
Import the Stata data, apply value labels, and improve the column headings.

Use the same methods for the Stata data that we used for the SPSS data:

nls97stata, metastata = pyreadstat.read_dta('data/nls97.dta', apply_value_formats=True, formats_as_category=True)
nls97stata.columns = metastata.column_labels
nls97stata.columns = nls97stata.columns.\
...     str.lower().\
...     str.replace(' ','_').\
...     str.replace('[^a-z0-9_]', '', regex=True)
nls97stata.dtypes

pubid__yth_id_code_1997	float64
keysex_symbol_1997		category
keybdate_my_symbol_1997	float64
keybdate_my_symbol_1997	float64
hrsnight_r_sleeps_2017	float64
cvc_wkswk_yr_all_l99	float64
dtype: object

View a few rows of the data and run frequencies:

nls97stata.head()

   pubid__yth_id_code_1997    keysex_symbol_1997  ...  \
0                        1                Female  ... 
1                        2                  Male  ... 
2                        3                Female  ... 
3                        4                Female  ... 
4                        5                  Male  ... 
   hrsnight_r_sleeps_2017    cvc_wkswk_yr_all_l99
0                      -5                      52
1                       6                       0
2                       6                       0
3                       6                       4
4                       5                      12
[5 rows x 42 columns]

nls97stata.govt_responsibility__provide_jobs_2006.\
...   value_counts(sort=False)

-5.0	1425
-4.0	5665
-2.0	56
-1.0	5
Definitely should be	454
Definitely should not be	300
Probably should be		617
Probably should not be	462
Name: govt_responsibility__provide_jobs_2006, dtype: int64

Fix the logical missing values that show up with the Stata data and set an index. We can use the replace method to set any value that is between –9 and –1 in any column to missing:

nls97stata.min(numeric_only=True)

pubid__yth_id_code_1997          1
keybdate_my_symbol_1997          1
keybdate_my_symbol_1997      1,980
trans_sat_verbal_hstr           -4
trans_sat_math_hstr             -4
trans_crd_gpa_overall_hstr      -9
trans_crd_gpa_eng_hstr          -9
trans_crd_gpa_math_hstr         -9
trans_crd_gpa_lp_sci_hstr       -9
cv_ba_credits_l1_2011           -5
cv_bio_child_hh_2017            -5
cv_bio_child_nr_2017            -5
hrsnight_r_sleeps_2017          -5
cvc_wkswk_yr_all_l99            -4
dtype: float64

nls97stata.replace(list(range(-9,0)), np.nan, inplace=True)
nls97stata.min(numeric_only=True)

pubid__yth_id_code_1997          1
keybdate_my_symbol_1997          1
keybdate_my_symbol_1997      1,980
trans_sat_verbal_hstr           14
trans_sat_math_hstr              7
trans_crd_gpa_overall_hstr      10
trans_crd_gpa_eng_hstr           0
trans_crd_gpa_math_hstr          0
trans_crd_gpa_lp_sci_hstr        0
cv_ba_credits_l1_2011            0
cv_bio_child_hh_2017             0
cv_bio_child_nr_2017             0
hrsnight_r_sleeps_2017           0
cvc_wkswk_yr_all_l99             0
dtype: float64

nls97stata.set_index('pubid__yth_id_code_1997', inplace=True)

The process is fairly similar when working with SAS data files, as the next few steps illustrate.

Retrieve the SAS data, using the SAS catalog file for value labels:

The data values for SAS are stored in a catalog file. Setting the catalog file path and filename retrieves the value labels and applies them:

nls97sas, metasas = pyreadstat.read_sas7bdat('data/nls97.sas7bdat', catalog_file='data/nlsformats3.sas7bcat', formats_as_category=True)
nls97sas.columns = metasas.column_labels
nls97sas.columns = nls97sas.columns.\
...     str.lower().\
...     str.replace(' ','_').\
...     str.replace('[^a-z0-9_]', '', regex=True)
nls97sas.head()

   pubid__yth_id_code_1997   keysex_symbol_1997    ...  \
0			     1		       Female    ... 
1			     2		         Male    ... 
2			     3		       Female    ... 
3			     4		       Female    ... 
4			     5		         Male    ... 
   hrsnight_r_sleeps_2017  cvc_wkswk_yr_all_l99
0			  nan			   52
1			    6			    0
2			    6			    0
3			    6			    4
4			    5			   12
[5 rows x 42 columns]

nls97sas.keysex_symbol_1997.value_counts()

Male		4599
Female	4385
Name: keysex_symbol_1997, dtype: int64

nls97sas.set_index('pubid__yth_id_code_1997', inplace=True)

This demonstrates how to import SPSS, SAS, and Stata data without losing important metadata.

How it works...

The read_sav, read_dta, and read_sas7bdat methods of Pyreadstat, for SPSS, Stata, and SAS data files, respectively, work in a similar manner. Value labels can be applied when reading in the data by setting apply_value_formats to True for SPSS and Stata files (Steps 5 and 8), or by providing a catalog file path and filename for SAS (Step 12).

We can set formats_as_category to True to change the data type to category for those columns where the data values will change. The meta object has the column names and the column labels from the statistical package, so metadata column labels can be assigned to pandas DataFrame column names at any point (nls97spss.columns = metaspss.column_labels). We can even revert to the original column headings after assigning meta column labels to them by setting pandas column names to the metadata column names (nls97spss.columns = metaspss.column_names).

In Step 3, we looked at some of the SPSS data before applying value labels. We looked at the dictionary for one variable (metaspss.variable_value_labels['R0536300']), but we could have viewed it for all variables (metaspss.variable_value_labels). When we are satisfied that the labels make sense, we can set them by calling the set_value_labels function. This is a good approach when you do not know the data well and want to inspect the labels before applying them.

The column labels from the meta object are often a better choice than the original column headings. Column headings can be quite cryptic, particularly when the SPSS, Stata, or SAS file is based on a large survey, as in this example. However, the labels are not usually ideal for column headings either. They sometimes have spaces, capitalization that is not helpful, and non-alphanumeric characters. We chain some string operations to switch to lowercase, replace spaces with underscores, and remove non-alphanumeric characters.

Handling missing values is not always straightforward with these data files, since there are often many reasons why data is missing. If the file is from a survey, the missing value may be because of a survey skip pattern, or a respondent failed to respond, or the response was invalid, and so on. The NLS has nine possible values for missing, from –1 to –9. The SPSS import automatically set those values to NaN, while the Stata import retained the original values. (We could have gotten the SPSS import to retain those values by setting user_missing to True.) For the Stata data, we need to tell it to replace all values from –1 to –9 with NaN. We do this by using the DataFrame’s replace function and passing it a list of integers from –9 to –1 (list(range(-9,0))).

There’s more…

You may have noticed similarities between this recipe and the previous one in terms of how value labels are set. The set_value_labels function is like the DataFrame replace operation we used to set value labels in that recipe. We passed a dictionary to replace that mapped columns to value labels. The set_value_labels function in this recipe essentially does the same thing, using the variable_value_labels property of the meta object as the dictionary.

Data from statistical packages is often not as well structured as SQL databases tend to be in one significant way. Since they are designed to facilitate analysis, they often violate database normalization rules. There is often an implied relational structure that might have to be unflattened at some point. For example, the data may combine individual and event-level data – a person and hospital visits, a brown bear and the date it emerged from hibernation. Often, this data will need to be reshaped for some aspects of the analysis.