One of my coding hobbies is to explore different Python packages and libraries. In this post, I'll talk about the package rpy2, which is used to call R inside python.
Being an avid user of R and a huge supporter of R graphical packages, I had always desired to call R inside my Python code to be able to produce beautiful visualizations. The R framework offers machinery for a variety of statistical and data mining tasks. Let's review the basics of R before we delve into R-Python interfacing.
R is a statistical language which is free, is open source, and has comprehensive support for various statistical, data mining, and visualization tasks. Quick-R describes it as:
"R is an elegant and comprehensive statistical and graphical programming language."
R is one of the fastest growing languages, mainly due to the surge in interest in statistical learning and data science. The Data Science Specialization on Coursera has all courses taught in R. There are R packages for machine learning, graphics, text mining, bioinformatics, topics modeling, interactive visualizations, markdown, and many others.
In this post, I'll give a quick introduction to R. The motivation is to acquire some knowledge of R to be able to follow the discussion on R-Python interfacing.
R can be downloaded from one of the Comprehensive R Archive Network (CRAN) mirror sites.
Rscript file.r
The most fundamental data structure in R is a vector; actually everything in R is a vector (even numbers are 1-dimensional vectors). This is one of the strangest things about R. Vectors contain elements of the same type. A vector is created by using the c() function.
a = c(1,2,5,9,11)
a
[1] 1 2 5 9 11
strings = c("aa", "apple", "beta", "down")
strings
[1] "aa" "apple" "beta" "down"
The elements in a vector are indexed, but the indexing starts at 1 instead of 0, as in most major languages (for example, python).
strings[1]
[1] "aa"
The fact that everything in R is a vector and that the indexing starts at 1 are the main reasons for people's initial frustration with R (I forget this all the time).
A lot of R packages expect data as a data frame, which are essentially matrices but the columns can be accessed by names. The columns can be of different types. Data frames are useful outside of R also. The Python package Pandas was written primarily to implement data frames and to do analysis on them.
In R, data frames are created (from vectors) as follows:
students = c("Anne", "Bret", "Carl", "Daron", "Emily")
scores = c(7,3,4,9,8)
grades = c('B', 'D', 'C', 'A', 'A')
results = data.frame(students, scores, grades)
results
students scores grades
1 Anne 7 B
2 Bret 3 D
3 Carl 4 C
4 Daron 9 A
5 Emily 8 A
The elements of a data frame can be accessed as:
results$students
[1] Anne Bret Carl Daron Emily
Levels: Anne Bret Carl Daron Emily
This gives a vector, the elements of which can be called by indexing.
results$students[1]
[1] Anne
Levels: Anne Bret Carl Daron Emily
Most of the times the data is given as a comma-separated values (csv) file or a tab-separated values (tsv) file. We will see how to read a csv/tsv file in R and create a data frame from it.
(Aside: The datasets in most Kaggle competitions are given as csv files and we are required to do machine learning on them. In Python, one creates a pandas data frame or a numpy array from this csv file.)
In R, we use a read.csv or read.table command to load a csv file into memory, for example, for the Titanic competition on Kaggle:
training_data <- read.csv("train.csv", header=TRUE)
train <- data.frame(survived=train_all$Survived,
age=train_all$Age,
fare=train_all$Fare,
pclass=train_all$Pclass)
Similarly, a tsv file can be loaded as:
data <- read.csv("file.tsv";, header=TRUE, delimiter="t")
Thus given a csv/tsv file with or without headers, we can read it using the read.csv function and create a data frame using:
data.frame(vector_1, vector_2, ... vector_n).
This should be enough to start exploring R packages. Another command that is very useful in R is head(), which is similar to the less command on Unix.
First things first, we need to have both Python and R installed. Then install rpy2 from the Python package index (Pypi). To do this, simply type the following on the command line:
pip install rpy2
We will use the high-level interface to R, the robjects subpackage of rpy2.
import rpy2.robjects as ro
We can pass commands to the R session by putting the R commands in the ro.r() method as strings. Recall that everything in R is a vector. Let's create a vector using robjects:
ro.r('x=c(2,4,6,8)')
print(ro.r('x'))
[1] 2 4 6 8
Keep in mind that though x is an R object (vector), ro.r('x') is a Python object (rpy2 object). This can be checked as follows:
type(ro.r('x'))
<class 'rpy2.robjects.vectors.FloatVector'>
The most important data types in R are data frames, which are essentially matrices. We can create a data frame using rpy2:
ro.r('x=c(2,4,6,8)')
ro.r('y=c(4,8,12,16)')
ro.r('rdf=data.frame(x,y)')
This created an R data frame, rdf.
If we want to manipulate this data frame using Python, we need to convert it to a python object. We will convert the R data frame to a pandas data frame. The Python package pandas contains efficient implementations of data frame objects in python.
import pandas.rpy.common as com
df = com.load_data('rdf')
print type(df)
<class 'pandas.core.frame.DataFrame'>
df.x = 2*df.x
Here we have doubled each of the elements of the x vector in the data frame df. But df is a Python object, which we can convert back to an R data frame using pandas as:
rdf = com.convert_to_r_dataframe(df)
print type(rdf)
<class 'rpy2.robjects.vectors.DataFrame'>
Let's use the plotting machinery of R, which is the main purpose of studying rpy2:
ro.r('plot(x,y)')
Not only R data types, but rpy2 lets us import R packages as well (given that these packages are installed on R) and use them for analysis. Here we will build a linear model on x and y using the R package stats:
from rpy2.robjects.packages import importr
stats = importr('stats')
base = importr('base')
fit = stats.lm('y ~ x', data=rdf)
print(base.summary(fit))
We get the following results:
Residuals:
1 2 3 4
0 0 0 0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0 0 NA NA
x 2 0 Inf <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0 on 2 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: Inf on 1 and 2 DF, p-value: < 2.2e-16
R programmers will immediately recognize the output as coming from applying linear model function lm() on data.
I'll end this discussion with an example using my favorite R package ggplot2. I have written a lot of posts on data visualization using ggplot2. The following example is borrowed from the official documentation of rpy2.
import math, datetime
import rpy2.robjects.lib.ggplot2 as ggplot2
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
base = importr('base')
datasets = importr('datasets')
mtcars = datasets.data.fetch('mtcars')['mtcars']
pp = ggplot2.ggplot(mtcars) +
ggplot2.aes_string(x='wt', y='mpg', col='factor(cyl)') +
ggplot2.geom_point() +
ggplot2.geom_smooth(ggplot2.aes_string(group = 'cyl'),
method = 'lm')
pp.plot()
Janu Verma is a researcher in the IBM T.J. Watson Research Center, New York. His research interests are in mathematics, machine learning, information visualization, computational biology, and healthcare analytics. He has held research positions at Cornell University, Kansas State University, Tata Institute of Fundamental Research, Indian Institute of Science, and the Indian Statistical Institute. He has written papers for IEEE Vis, KDD, International Conference on HealthCare Informatics, Computer Graphics and Applications, Nature Genetics, IEEE Sensors Journals and so on. His current focus is on the development of visual analytics systems for prediction and understanding. He advises start-ups and other companies on data science and machine learning in the Delhi-NCR area. He can be found at Here.