Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Data Cleaning Cookbook

You're reading from   Python Data Cleaning Cookbook Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI

Arrow left icon
Product type Paperback
Published in May 2024
Publisher Packt
ISBN-13 9781803239873
Length 486 pages
Edition 2nd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Michael Walker Michael Walker
Author Profile Icon Michael Walker
Michael Walker
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Anticipating Data Cleaning Issues When Importing Tabular Data with pandas 2. Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data FREE CHAPTER 3. Taking the Measure of Your Data 4. Identifying Outliers in Subsets of Data 5. Using Visualizations for the Identification of Unexpected Values 6. Cleaning and Exploring Data with Series Operations 7. Identifying and Fixing Missing Values 8. Encoding, Transforming, and Scaling Features 9. Fixing Messy Data When Aggregating 10. Addressing Data Issues When Combining DataFrames 11. Tidying and Reshaping Data 12. Automate Data Cleaning with User-Defined Functions, Classes, and Pipelines 13. Index

Importing SPSS, Stata, and SAS data

We will use pyreadstat to read data from three popular statistical packages into pandas. The key advantage of pyreadstat is that it allows data analysts to import data from these packages without losing metadata, such as variable and value labels.

The SPSS, Stata, and SAS data files we receive often come to us with the data issues of CSV and Excel files and SQL databases having been resolved. We do not typically have the invalid column names, changes in data types, and unclear missing values that we can get with CSV or Excel files, nor do we usually get the detachment of data from business logic, such as the meaning of data codes, that we often get with SQL data. When someone or some organization shares a data file from one of these packages with us, they have often added variable labels and value labels for categorical data. For example, a hypothetical data column called presentsat has the overall satisfaction with presentation variable label and 15 value labels, with 1 being not at all satisfied and 5 being highly satisfied.

The challenge is retaining that metadata when importing data from those systems into pandas. There is no precise equivalent to variable and value labels in pandas, and built-in tools for importing SAS, Stata, and SAS data lose the metadata. In this recipe, we will use pyreadstat to load variable and value label information and use a couple of techniques to represent that information in pandas.

Getting ready

This recipe assumes you have installed the pyreadstat package. If it is not installed, you can install it with pip. From the Terminal, or Powershell (in Windows), enter pip install pyreadstat. You will need the SPSS, Stata, and SAS data files for this recipe to run the code.

We will work with data from the United States National Longitudinal Surveys (NLS) of Youth.

Data note

The NLS of Youth is conducted by the United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997. Each survey respondent was high school age when they first completed the survey, having been born between 1980 and 1985. There were annual follow-up surveys each year through 2023. For this recipe, I pulled 42 variables on grades, employment, income, and attitudes toward government, from the hundreds of data items on the survey. Separate files for SPSS, Stata, and SAS can be downloaded from the repository.

The original NLS data can be downloaded from https://www.nlsinfo.org/investigator/pages/search, along with code for creating SPSS, Stata, or SAS files from the ASCII data files included in the download.

How to do it...

We will import data from SPSS, Stata, and SAS, retaining metadata such as value labels:

  1. Import pandas, numpy, and pyreadstat.

This step assumes that you have installed pyreadstat:

import pandas as pd
import numpy as np
import pyreadstat
  1. Retrieve the SPSS data.

Pass a path and filename to the read_sav method of pyreadstat. Display the first few rows and a frequency distribution. Note that the column names and value labels are non-descriptive, and that read_sav returns both a pandas DataFrame and a meta object:

nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav')
nls97spss.dtypes
R0000100	float64
R0536300	float64
R0536401	float64
...
U2962900	float64
U2963000	float64
Z9063900	float64
dtype: object
nls97spss.head()
   R0000100  R0536300  ...  U2963000  Z9063900
0	1	2         ...  nan       52
1	2	1         ...  6         0
2	3	2         ...  6         0
3	4	2         ...  6         4
4	5	1         ...  5         12
[5 rows x 42 columns]
nls97spss['R0536300'].value_counts(normalize=True)
1.00	0.51
2.00	0.49
Name: R0536300, dtype: float64
  1. Grab the metadata to improve column labels and value labels.

The metaspss object created when we called read_sav has the column labels and the value labels from the SPSS file. Use the variable_value_labels dictionary to map values to value labels for one column (R0536300). (This does not change the data. It only improves our display when we run value_counts.) Use the set_value_labels method to actually apply the value labels to the DataFrame:

metaspss.variable_value_labels['R0536300']
{0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'}
nls97spss['R0536300'].\
...   map(metaspss.variable_value_labels['R0536300']).\
...   value_counts(normalize=True)
Male		0.51
Female	0.49
Name: R0536300, dtype: float64
nls97spss = pyreadstat.set_value_labels(nls97spss, metaspss, formats_as_category=True)
  1. Use column labels in the metadata to rename the columns.

To use the column labels from metaspss in our DataFrame, we can simply assign the column labels in metaspss to our DataFrame’s column names. Clean up the column names a bit by changing them to lowercase, changing spaces to underscores, and removing all remaining non-alphanumeric characters:

nls97spss.columns = metaspss.column_labels
nls97spss['KEY!SEX (SYMBOL) 1997'].value_counts(normalize=True)
Male		0.51
Female	0.49
Name: KEY!SEX (SYMBOL) 1997, dtype: float64
nls97spss.dtypes
PUBID - YTH ID CODE 		1997	float64
KEY!SEX (SYMBOL) 			1997	category
KEY!BDATE M/Y (SYMBOL)		1997	float64
KEY!BDATE M/Y (SYMBOL) 		1997	float64
CV_SAMPLE_TYPE 			1997	category
KEY!RACE_ETHNICITY (SYMBOL) 	1997	category
"... abbreviated to save space"
HRS/WK R WATCHES TELEVISION 	2017	category
HRS/NIGHT R SLEEPS 		 	2017	float64
CVC_WKSWK_YR_ALL L99			float64
dtype: object
nls97spss.columns = nls97spss.columns.\
...     str.lower().\
...     str.replace(' ','_').\
...     str.replace('[^a-z0-9_]', '', regex=True)
nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)
  1. Simplify the process by applying the value labels from the beginning.

The data values can actually be applied in the initial call to read_sav by setting apply_value_formats to True. This eliminates the need to call the set_value_labels function later:

nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav', apply_value_formats=True, formats_as_category=True)
nls97spss.columns = metaspss.column_labels
nls97spss.columns = nls97spss.columns.\
...   str.lower().\
...   str.replace(' ','_').\
...   str.replace('[^a-z0-9_]', '', regex=True)
  1. Show the columns and a few rows:
    nls97spss.dtypes
    
    pubid__yth_id_code_1997	float64
    keysex_symbol_1997		category
    keybdate_my_symbol_1997	float64
    keybdate_my_symbol_1997	float64
    hrsnight_r_sleeps_2017	float64
    cvc_wkswk_yr_all_l99	float64
    dtype: object
    
    nls97spss.head()
    
       pubid__yth_id_code_1997 keysex_symbol_1997  ...  \
    0	1	Female  ... 
    1	2	Male  ... 
    2	3	Female  ... 
    3	4	Female  ... 
    4	5	Male  ... 
       hrsnight_r_sleeps_2017  cvc_wkswk_yr_all_l99
    0	nan	52
    1	6	0
    2	6	0
    3	6	4
    4	5	12
    [5 rows x 42 columns]
    
  2. Run frequencies on one of the columns, and set the index:
    nls97spss.govt_responsibility__provide_jobs_2006.\
    ...   value_counts(sort=False)
    
    Definitely should be	454
    Definitely should not be	300
    Probably should be		617
    Probably should not be	462
    Name: govt_responsibility__provide_jobs_2006, dtype: int64
    
    nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)
    
  3. That demonstrated how to convert data from SPSS. Let’s try that with Stata data.
  4. Import the Stata data, apply value labels, and improve the column headings.

Use the same methods for the Stata data that we used for the SPSS data:

nls97stata, metastata = pyreadstat.read_dta('data/nls97.dta', apply_value_formats=True, formats_as_category=True)
nls97stata.columns = metastata.column_labels
nls97stata.columns = nls97stata.columns.\
...     str.lower().\
...     str.replace(' ','_').\
...     str.replace('[^a-z0-9_]', '', regex=True)
nls97stata.dtypes
pubid__yth_id_code_1997	float64
keysex_symbol_1997		category
keybdate_my_symbol_1997	float64
keybdate_my_symbol_1997	float64
hrsnight_r_sleeps_2017	float64
cvc_wkswk_yr_all_l99	float64
dtype: object
  1. View a few rows of the data and run frequencies:
    nls97stata.head()
    
       pubid__yth_id_code_1997    keysex_symbol_1997  ...  \
    0                        1                Female  ... 
    1                        2                  Male  ... 
    2                        3                Female  ... 
    3                        4                Female  ... 
    4                        5                  Male  ... 
       hrsnight_r_sleeps_2017    cvc_wkswk_yr_all_l99
    0                      -5                      52
    1                       6                       0
    2                       6                       0
    3                       6                       4
    4                       5                      12
    [5 rows x 42 columns]
    
    nls97stata.govt_responsibility__provide_jobs_2006.\
    ...   value_counts(sort=False)
    
    -5.0	1425
    -4.0	5665
    -2.0	56
    -1.0	5
    Definitely should be	454
    Definitely should not be	300
    Probably should be		617
    Probably should not be	462
    Name: govt_responsibility__provide_jobs_2006, dtype: int64
    
  2. Fix the logical missing values that show up with the Stata data and set an index. We can use the replace method to set any value that is between –9 and –1 in any column to missing:
    nls97stata.min(numeric_only=True)
    
    pubid__yth_id_code_1997          1
    keybdate_my_symbol_1997          1
    keybdate_my_symbol_1997      1,980
    trans_sat_verbal_hstr           -4
    trans_sat_math_hstr             -4
    trans_crd_gpa_overall_hstr      -9
    trans_crd_gpa_eng_hstr          -9
    trans_crd_gpa_math_hstr         -9
    trans_crd_gpa_lp_sci_hstr       -9
    cv_ba_credits_l1_2011           -5
    cv_bio_child_hh_2017            -5
    cv_bio_child_nr_2017            -5
    hrsnight_r_sleeps_2017          -5
    cvc_wkswk_yr_all_l99            -4
    dtype: float64
    
    nls97stata.replace(list(range(-9,0)), np.nan, inplace=True)
    nls97stata.min(numeric_only=True)
    
    pubid__yth_id_code_1997          1
    keybdate_my_symbol_1997          1
    keybdate_my_symbol_1997      1,980
    trans_sat_verbal_hstr           14
    trans_sat_math_hstr              7
    trans_crd_gpa_overall_hstr      10
    trans_crd_gpa_eng_hstr           0
    trans_crd_gpa_math_hstr          0
    trans_crd_gpa_lp_sci_hstr        0
    cv_ba_credits_l1_2011            0
    cv_bio_child_hh_2017             0
    cv_bio_child_nr_2017             0
    hrsnight_r_sleeps_2017           0
    cvc_wkswk_yr_all_l99             0
    dtype: float64
    
    nls97stata.set_index('pubid__yth_id_code_1997', inplace=True)
    

The process is fairly similar when working with SAS data files, as the next few steps illustrate.

  1. Retrieve the SAS data, using the SAS catalog file for value labels:

The data values for SAS are stored in a catalog file. Setting the catalog file path and filename retrieves the value labels and applies them:

nls97sas, metasas = pyreadstat.read_sas7bdat('data/nls97.sas7bdat', catalog_file='data/nlsformats3.sas7bcat', formats_as_category=True)
nls97sas.columns = metasas.column_labels
nls97sas.columns = nls97sas.columns.\
...     str.lower().\
...     str.replace(' ','_').\
...     str.replace('[^a-z0-9_]', '', regex=True)
nls97sas.head()
   pubid__yth_id_code_1997   keysex_symbol_1997    ...  \
0			     1		       Female    ... 
1			     2		         Male    ... 
2			     3		       Female    ... 
3			     4		       Female    ... 
4			     5		         Male    ... 
   hrsnight_r_sleeps_2017  cvc_wkswk_yr_all_l99
0			  nan			   52
1			    6			    0
2			    6			    0
3			    6			    4
4			    5			   12
[5 rows x 42 columns]
nls97sas.keysex_symbol_1997.value_counts()
Male		4599
Female	4385
Name: keysex_symbol_1997, dtype: int64
nls97sas.set_index('pubid__yth_id_code_1997', inplace=True)

This demonstrates how to import SPSS, SAS, and Stata data without losing important metadata.

How it works...

The read_sav, read_dta, and read_sas7bdat methods of Pyreadstat, for SPSS, Stata, and SAS data files, respectively, work in a similar manner. Value labels can be applied when reading in the data by setting apply_value_formats to True for SPSS and Stata files (Steps 5 and 8), or by providing a catalog file path and filename for SAS (Step 12).

We can set formats_as_category to True to change the data type to category for those columns where the data values will change. The meta object has the column names and the column labels from the statistical package, so metadata column labels can be assigned to pandas DataFrame column names at any point (nls97spss.columns = metaspss.column_labels). We can even revert to the original column headings after assigning meta column labels to them by setting pandas column names to the metadata column names (nls97spss.columns = metaspss.column_names).

In Step 3, we looked at some of the SPSS data before applying value labels. We looked at the dictionary for one variable (metaspss.variable_value_labels['R0536300']), but we could have viewed it for all variables (metaspss.variable_value_labels). When we are satisfied that the labels make sense, we can set them by calling the set_value_labels function. This is a good approach when you do not know the data well and want to inspect the labels before applying them.

The column labels from the meta object are often a better choice than the original column headings. Column headings can be quite cryptic, particularly when the SPSS, Stata, or SAS file is based on a large survey, as in this example. However, the labels are not usually ideal for column headings either. They sometimes have spaces, capitalization that is not helpful, and non-alphanumeric characters. We chain some string operations to switch to lowercase, replace spaces with underscores, and remove non-alphanumeric characters.

Handling missing values is not always straightforward with these data files, since there are often many reasons why data is missing. If the file is from a survey, the missing value may be because of a survey skip pattern, or a respondent failed to respond, or the response was invalid, and so on. The NLS has nine possible values for missing, from –1 to –9. The SPSS import automatically set those values to NaN, while the Stata import retained the original values. (We could have gotten the SPSS import to retain those values by setting user_missing to True.) For the Stata data, we need to tell it to replace all values from –1 to –9 with NaN. We do this by using the DataFrame’s replace function and passing it a list of integers from –9 to –1 (list(range(-9,0))).

There’s more…

You may have noticed similarities between this recipe and the previous one in terms of how value labels are set. The set_value_labels function is like the DataFrame replace operation we used to set value labels in that recipe. We passed a dictionary to replace that mapped columns to value labels. The set_value_labels function in this recipe essentially does the same thing, using the variable_value_labels property of the meta object as the dictionary.

Data from statistical packages is often not as well structured as SQL databases tend to be in one significant way. Since they are designed to facilitate analysis, they often violate database normalization rules. There is often an implied relational structure that might have to be unflattened at some point. For example, the data may combine individual and event-level data – a person and hospital visits, a brown bear and the date it emerged from hibernation. Often, this data will need to be reshaped for some aspects of the analysis.

See also

The pyreadstat package is nicely documented at https://github.com/Roche/pyreadstat. The package has many useful options for selecting columns and handling missing data that space did not permit me to demonstrate in this recipe. In Chapter 11, Tidying and Reshaping Data, we will examine how to normalize data that may have been flattened for analytical purposes.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime