In this post, we will learn about data visualization using ggplot2. ggplot2 is an R package for data exploration and visualization. It produces amazing graphics that are easy to interpret. The main use of ggplot2 is in exploratory analysis, and it is an important element of a data scientist’s toolkit. The ease with which complex graphs can be plotted using ggplot2 is probably its most attractive feature. It also allows you to slice and dice data in many different ways. ggplot2 is an implementation of A Layered Grammar of Graphics by Hadley Wickham, who is certainly the strongest R programmer out there.
Installing packages in R is very easy. Just type the following command on the R prompt.
install.packages("ggplot2")
Import the package in your R code.
library(ggplot2)
We will start with the function qplot(). qplot is the simplest plotting function in ggplot2. It is very similar to the generic plot() function in basic R. We will learn how to plot basic statistical and exploratory plots using qplot.
We will use the Iris dataset that comes with the base R package (and with every other data mining package that I know of). The Iris data consists of observations of phenotypic traits of three species of iris. In R, the iris data is provided as a data frame of 150 rows and 5 columns.
The head command will print first 6 rows of the data.
head(iris)
The general syntax for the qplot function is:
qplot(x,y, data=data.frame)
qplot(Sepal.Length, Petal.Length, data = iris)
qplot(Sepal.Length, Petal.Length, data = iris, color = Species)
An observant reader would notice that this coloring scheme provides a way to visualize clustering.
qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width)
Thus we have a visualization for four-dimensional data.
qplot(Sepal.Length, Petal.Length, data = iris, color = Species,
size = Petal.Width, alpha = I(0.7))
This reduces the over-plotting of the data.
qplot(Sepal.Length, Petal.Length, data = iris, color = Species, xlab = "Sepal Length", ylab = "Petal Length",
main = "Sepal vs. Petal Length in Fisher's Iris data")
All the above graphs were scatterplots. We can use the geom argument to draw other types of graphs.
qplot(Sepal.Length, data = iris, geom="bar")
qplot(Sepal.Length, Petal.Length, data = iris, geom = "line",
color = Species)
Now we'll move to the ggplot() function, which has a much broader range of graphing techniques. We'll start with the basic plots similar to what we did with qplot().
First things first, load the library:
library(ggplot2)
As before, we will use the iris dataset.
For ggplot(), we generate aesthetic mappings that describe how variables in the data are mapped to visual properties. This is specified by the aes function.
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) +
geom_point()
This is exactly what we got for qplot().
The syntax is a bit unintuitive, but is very consistent. The basic structure is:
ggplot(data.frame, aes(x=, y=, ...)) + geom_*(.) + ....
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point()
We can other geoms to create different types of graphs, for example, linechart:
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color=Species)) +
geom_line() + ggtitle("Plot of sepal length vs. petal length")
ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram(binwidth = .2)
Use the fill argument
ggplot(iris, aes(x = Sepal.Length, fill=Species)) +
geom_histogram(binwidth = .2)
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
geom_histogram(binwidth = .2, position = "dodge")
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color=Species)) +
geom_point() + ggtitle("Plot of sepal length vs. petal length")
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color=Species, size=Petal.Width)) +
geom_point(alpha=0.7) + ggtitle("Plot of sepal length vs. petal length")
ggplot(iris, aes(x = log(Sepal.Length), y = Petal.Length/Petal.Width, color=Species)) +
geom_point()
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color=Species)) +
geom_point() + facet_wrap(~Species) +
ggtitle("Plot of sepal length vs. petal length")
We can use a whole range of themes for ggplot using the R package ggthemes.
install.packages('ggthemes', dependencies = TRUE)
library(ggthemes)
Essentially you add the theme_*() argument to the ggplot call.
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point() + theme_economist()
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point() + theme_fivethirtyeight()
Janu Verma is a researcher in the IBM T.J. Watson Research Center, New York. His research interests are mathematics, machine learning, information visualization, computational biology and healthcare analytics. He has held research positions at Cornell University, Kansas State University, Tata Institute of Fundamental Research, Indian Institute of Science, and Indian Statistical Institute. He has written papers for IEEE Vis, KDD, International Conference on HealthCare Informatics, Computer Graphics and Applications, Nature Genetics, IEEE Sensors Journals, and so on. His current focus is on the development of visual analytics systems for prediction and understanding. He advises start-ups and companies on data science and machine learning in the Delhi-NCR area.