You're reading from Python Data Cleaning Cookbook Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI

Product type Paperback

Published in May 2024

Publisher Packt

ISBN-13 9781803239873

Length 486 pages

Edition 2nd Edition

Languages

Python

Tools

Matplotlib

Concepts

Data Analysis

Author (1):

Michael Walker

View More author details

Table of Contents (14) Chapters

Preface

1. Anticipating Data Cleaning Issues When Importing Tabular Data with pandas FREE CHAPTER

2. Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data

3. Taking the Measure of Your Data

4. Identifying Outliers in Subsets of Data

5. Using Visualizations for the Identification of Unexpected Values

6. Cleaning and Exploring Data with Series Operations

7. Identifying and Fixing Missing Values

8. Encoding, Transforming, and Scaling Features

9. Fixing Messy Data When Aggregating

10. Addressing Data Issues When Combining DataFrames

11. Tidying and Reshaping Data

12. Automate Data Cleaning with User-Defined Functions, Classes, and Pipelines

13. Index

Importing R data

We will use pyreadr to read an R data file into pandas. Since pyreadr cannot capture the metadata, we will write code to reconstruct value labels (analogous to R factors) and column headings. This is similar to what we did in the Importing data from SQL databases recipe.

The R statistical package is, in many ways, similar to the combination of Python and pandas, at least in its scope. Both have strong tools across a range of data preparation and data analysis tasks. Some data scientists work with both R and Python, perhaps doing data manipulation in Python and statistical analysis in R, or vice versa, depending on their preferred packages. However, there is currently a scarcity of tools for reading data saved in R, as rds or rdata files, into Python. The analyst often saves the data as a CSV file first and then loads it into Python. We will use pyreadr, from the same author as pyreadstat, because it does not require an installation of R.

When we receive an R file, or work with one we have created ourselves, we can count on it being fairly well structured, at least compared to CSV or Excel files. Each column will have only one data type, column headings will have appropriate names for Python variables, and all rows will have the same structure. However, we may need to restore some of the coding logic, as we did when working with SQL data.

Getting ready

This recipe assumes you have installed the pyreadr package. If it is not installed, you can install it with pip. From the Terminal, or Powershell (in Windows), enter pip install pyreadr.

We will again work with the NLS in this recipe. You will need to download the rds file used in this recipe from the GitHub repository in order to run the code.

How to do it…

We will import data from R without losing important metadata:

Load pandas, numpy, pprint, and the pyreadr package:

import pandas as pd
import numpy as np
import pyreadr
import pprint

Get the R data.

Pass the path and filename to the read_r method to retrieve the R data, and load it into memory as a pandas DataFrame. read_r can return one or more objects. When reading an rds file (as opposed to an rdata file), it will return one object, having the key None. We indicate None to get the pandas DataFrame:

nls97r = pyreadr.read_r('data/nls97.rds')[None]
nls97r.dtypes

R0000100	int32
R0536300	int32
...
U2962800	int32
U2962900	int32
U2963000	int32
Z9063900	int32
dtype: object

nls97r.head(10)

     R0000100  R0536300  ...      U2963000    Z9063900
0	1	  2         ...      -5          52
1	2	  1         ...       6          0
2	3	  2         ...       6          0
3	4	  2         ...       6          4
4	5	  1         ...       5          12
5	6	  2         ...       6          6
6	7	  1         ...      -5          0
7	8	  2         ...      -5          39
8	9	  1         ...       4          0
9	10	  1         ...       6          0
[10 rows x 42 columns]

Set up dictionaries for value labels and column headings.

Load a dictionary that maps columns to the value labels and create a list of preferred column names as follows:

with open('data/nlscodes.txt', 'r') as reader:
...     setvalues = eval(reader.read())
...
pprint.pprint(setvalues)

{'R0536300': {0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'},
 'R1235800': {0.0: 'Oversample', 1.0: 'Cross-sectional'},
 'S8646900': {1.0: '1. Definitely',
              2.0: '2. Probably ',
              3.0: '3. Probably not',
              4.0: '4. Definitely not'}}
...abbreviated to save space
newcols = ['personid','gender','birthmonth',
...   'birthyear','sampletype','category',
...   'satverbal','satmath','gpaoverall',
...   'gpaeng','gpamath','gpascience','govjobs',
...   'govprices','govhealth','goveld','govind',
...   'govunemp','govinc','govcollege',
...   'govhousing','govenvironment','bacredits',
...   'coltype1','coltype2','coltype3','coltype4',
...   'coltype5','coltype6','highestgrade',
...   'maritalstatus','childnumhome','childnumaway',
...   'degreecol1','degreecol2','degreecol3',
...   'degreecol4','wageincome','weeklyhrscomputer',
...   'weeklyhrstv','nightlyhrssleep',
...   'weeksworkedlastyear']

Set value labels and missing values, and change selected columns to the category data type.

Use the setvalues dictionary to replace existing values with value labels. Replace all values from –9 to –1 with NaN:

nls97r.replace(setvalues, inplace=True)
nls97r.head()

     R0000100   R0536300  ...     U2963000  Z9063900
0    1          Female    ...     -5        52
1    2          Male      ...     6         0
2    3          Female    ...     6         0
3    4          Female    ...     6         4
4    5          Male      ...     5         12
[5 rows x 42 columns]

nls97r.replace(list(range(-9,0)), np.nan, inplace=True)
for col in nls97r[[k for k in setvalues]].columns:
...     nls97r[col] = nls97r[col].astype('category')
...
nls97r.dtypes

R0000100	int64
R0536300	category
R0536401	int64
R0536402	int64
R1235800	category
              ... 
U2857300	category
U2962800	category
U2962900	category
U2963000	float64
Z9063900	float64
Length: 42, dtype: object

Set meaningful column headings:

nls97r.columns = newcols
nls97r.dtypes

personid	int64
gender	category
birthmonth	int64
birthyear	int64
sampletype	category
                         ... 
wageincome	category
weeklyhrscomputer	category
weeklyhrstv	category
nightlyhrssleep	float64
weeksworkedlastyear	float64
Length: 42, dtype: object

This shows how R data files can be imported into pandas and value labels assigned.

How it works…

Reading R data into pandas with pyreadr is fairly straightforward. Passing a filename to the read_r function is all that is required. Since read_r can return multiple objects with one call, we need to specify which object. When reading an rds file (as opposed to an rdata file), only one object is returned. It has the key None.

In Step 3, we loaded a dictionary that maps our variables to value labels, and a list for our preferred column headings. In Step 4 we applied the value labels. We also changed the data type to category for the columns where we applied the values. We did this by generating a list of the keys in our setvalues dictionary with [k for k in setvalues] and then iterating over those columns.

We change the column headings in Step 5 to ones that are more intuitive. Note that the order matters here. We need to set the value labels before changing the column names, since the setvalues dictionary is based on the original column headings.

The main advantage of using pyreadr to read R files directly into pandas is that we do not have to convert the R data into a CSV file first. Once we have written our Python code to read the file, we can just rerun it whenever the R data changes. This is particularly helpful when we do not have R on the machine where we work.

There’s more…

Pyreadr is able to return multiple DataFrames. This is useful when we save several data objects in R as an rdata file. We can return all of them with one call.

Pprint is a handy tool for improving the display of Python dictionaries.

We could have used rpy2 instead of pyreadr to import R data. rpy2 requires that R also be installed, but it is more powerful than pyreadr. It will read R factors and automatically set them to pandas DataFrame values. See the following code:

import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
readRDS = robjects.r['readRDS']
nls97withvalues = readRDS('data/nls97withvalues.rds')
nls97withvalues

          R0000100      R0536300     ...    U2963000         Z9063900
1         1             Female       ...    -2147483648      52
2         2             Male         ...    6                0
3         3             Female       ...    6                0
4         4             Female       ...    6                4
5         5             Male         ...    5                12
...       ...           ...          ...    ...             ...
8980     9018           Female       ...    4                49
8981     9019           Male         ...    6                0
8982     9020           Male         ...    -2147483648      15
8983     9021           Male         ...    7                50
8984     9022           Female       ...    7                20
[8984 rows x 42 columns]

This generates unusual –2147483648 values. This is what happened when readRDS interpreted missing data in numeric columns. A global replacement of that number with NaN, after confirming that that is not a valid value, would be a good next step.