Importing R data
We will use pyreadr
to read an R data file into pandas. Since pyreadr
cannot capture the metadata, we will write code to reconstruct value labels (analogous to R factors) and column headings. This is similar to what we did in the Importing data from SQL databases recipe.
The R statistical package is, in many ways, similar to the combination of Python and pandas, at least in its scope. Both have strong tools across a range of data preparation and data analysis tasks. Some data scientists work with both R and Python, perhaps doing data manipulation in Python and statistical analysis in R, or vice versa, depending on their preferred packages. However, there is currently a scarcity of tools for reading data saved in R, as rds
or rdata
files, into Python. The analyst often saves the data as a CSV file first and then loads it into Python. We will use pyreadr
, from the same author as pyreadstat
, because it does not require an installation of R.
When we receive an R file, or work with one we have created ourselves, we can count on it being fairly well structured, at least compared to CSV or Excel files. Each column will have only one data type, column headings will have appropriate names for Python variables, and all rows will have the same structure. However, we may need to restore some of the coding logic, as we did when working with SQL data.
Getting ready
This recipe assumes you have installed the pyreadr
package. If it is not installed, you can install it with pip
. From the Terminal, or Powershell (in Windows), enter pip install pyreadr
.
We will again work with the NLS in this recipe. You will need to download the rds
file used in this recipe from the GitHub repository in order to run the code.
How to do it…
We will import data from R without losing important metadata:
- Load
pandas
,numpy
,pprint
, and thepyreadr
package:import pandas as pd import numpy as np import pyreadr import pprint
- Get the R data.
Pass the path and filename to the read_r
method to retrieve the R data, and load it into memory as a pandas DataFrame. read_r
can return one or more objects. When reading an rds
file (as opposed to an rdata
file), it will return one object, having the key None
. We indicate None
to get the pandas DataFrame:
nls97r = pyreadr.read_r('data/nls97.rds')[None]
nls97r.dtypes
R0000100 int32
R0536300 int32
...
U2962800 int32
U2962900 int32
U2963000 int32
Z9063900 int32
dtype: object
nls97r.head(10)
R0000100 R0536300 ... U2963000 Z9063900
0 1 2 ... -5 52
1 2 1 ... 6 0
2 3 2 ... 6 0
3 4 2 ... 6 4
4 5 1 ... 5 12
5 6 2 ... 6 6
6 7 1 ... -5 0
7 8 2 ... -5 39
8 9 1 ... 4 0
9 10 1 ... 6 0
[10 rows x 42 columns]
- Set up dictionaries for value labels and column headings.
Load a dictionary that maps columns to the value labels and create a list of preferred column names as follows:
with open('data/nlscodes.txt', 'r') as reader:
... setvalues = eval(reader.read())
...
pprint.pprint(setvalues)
{'R0536300': {0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'},
'R1235800': {0.0: 'Oversample', 1.0: 'Cross-sectional'},
'S8646900': {1.0: '1. Definitely',
2.0: '2. Probably ',
3.0: '3. Probably not',
4.0: '4. Definitely not'}}
...abbreviated to save space
newcols = ['personid','gender','birthmonth',
... 'birthyear','sampletype','category',
... 'satverbal','satmath','gpaoverall',
... 'gpaeng','gpamath','gpascience','govjobs',
... 'govprices','govhealth','goveld','govind',
... 'govunemp','govinc','govcollege',
... 'govhousing','govenvironment','bacredits',
... 'coltype1','coltype2','coltype3','coltype4',
... 'coltype5','coltype6','highestgrade',
... 'maritalstatus','childnumhome','childnumaway',
... 'degreecol1','degreecol2','degreecol3',
... 'degreecol4','wageincome','weeklyhrscomputer',
... 'weeklyhrstv','nightlyhrssleep',
... 'weeksworkedlastyear']
- Set value labels and missing values, and change selected columns to the
category
data type.
Use the setvalues
dictionary to replace existing values with value labels. Replace all values from –9
to –1
with NaN
:
nls97r.replace(setvalues, inplace=True)
nls97r.head()
R0000100 R0536300 ... U2963000 Z9063900
0 1 Female ... -5 52
1 2 Male ... 6 0
2 3 Female ... 6 0
3 4 Female ... 6 4
4 5 Male ... 5 12
[5 rows x 42 columns]
nls97r.replace(list(range(-9,0)), np.nan, inplace=True)
for col in nls97r[[k for k in setvalues]].columns:
... nls97r[col] = nls97r[col].astype('category')
...
nls97r.dtypes
R0000100 int64
R0536300 category
R0536401 int64
R0536402 int64
R1235800 category
...
U2857300 category
U2962800 category
U2962900 category
U2963000 float64
Z9063900 float64
Length: 42, dtype: object
- Set meaningful column headings:
nls97r.columns = newcols nls97r.dtypes
personid int64 gender category birthmonth int64 birthyear int64 sampletype category ... wageincome category weeklyhrscomputer category weeklyhrstv category nightlyhrssleep float64 weeksworkedlastyear float64 Length: 42, dtype: object
This shows how R data files can be imported into pandas and value labels assigned.
How it works…
Reading R data into pandas with pyreadr
is fairly straightforward. Passing a filename to the read_r
function is all that is required. Since read_r
can return multiple objects with one call, we need to specify which object. When reading an rds
file (as opposed to an rdata
file), only one object is returned. It has the key None
.
In Step 3, we loaded a dictionary that maps our variables to value labels, and a list for our preferred column headings. In Step 4 we applied the value labels. We also changed the data type to category
for the columns where we applied the values. We did this by generating a list of the keys in our setvalues
dictionary with [k for k in setvalues]
and then iterating over those columns.
We change the column headings in Step 5 to ones that are more intuitive. Note that the order matters here. We need to set the value labels before changing the column names, since the setvalues
dictionary is based on the original column headings.
The main advantage of using pyreadr
to read R files directly into pandas is that we do not have to convert the R data into a CSV file first. Once we have written our Python code to read the file, we can just rerun it whenever the R data changes. This is particularly helpful when we do not have R on the machine where we work.
There’s more…
Pyreadr
is able to return multiple DataFrames. This is useful when we save several data objects in R as an rdata
file. We can return all of them with one call.
Pprint
is a handy tool for improving the display of Python dictionaries.
We could have used rpy2
instead of pyreadr
to import R data. rpy2
requires that R also be installed, but it is more powerful than pyreadr
. It will read R factors and automatically set them to pandas DataFrame values. See the following code:
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
readRDS = robjects.r['readRDS']
nls97withvalues = readRDS('data/nls97withvalues.rds')
nls97withvalues
R0000100 R0536300 ... U2963000 Z9063900
1 1 Female ... -2147483648 52
2 2 Male ... 6 0
3 3 Female ... 6 0
4 4 Female ... 6 4
5 5 Male ... 5 12
... ... ... ... ... ...
8980 9018 Female ... 4 49
8981 9019 Male ... 6 0
8982 9020 Male ... -2147483648 15
8983 9021 Male ... 7 50
8984 9022 Female ... 7 20
[8984 rows x 42 columns]
This generates unusual –2147483648 values. This is what happened when readRDS
interpreted missing data in numeric columns. A global replacement of that number with NaN
, after confirming that that is not a valid value, would be a good next step.
See also
Clear instructions and examples for pyreadr
are available at https://github.com/ofajardo/pyreadr.
Feather files, a relatively new format, can be read by both R and Python. I discuss those files in the next recipe.