ggplot2 and the grammar of graphics
The ggplot2
package was developed by Hadley Wickham by implementing a completely different approach to statistical plots. As is the case with lattice
, this package is also based on grid
, providing a series of high-level functions that allow the creation of complete plots. The ggplot2
package provides an interpretation and extension of the principles of the book The Grammar of Graphics by Leland Wilkinson. Briefly, The Grammar of Graphics assumes that a statistical graphic is a mapping of data to the aesthetic attributes and geometric objects used to represent data, such as points, lines, bars, and so on. Besides the aesthetic attributes, the plot can also contain statistical transformation or grouping of data. As in lattice
, in ggplot2
, we have the possibility of splitting data by a certain variable, obtaining a representation of each subset of data in an independent subplot; such representation in ggplot2
is called faceting.
In a more formal way, the main components of the grammar of graphics are the data and its mapping, aesthetics, geometric objects, statistical transformations, scales, coordinates, and faceting. We will cover each one of these elements in more detail in Chapter 3, The Layers and Grammar of Graphics, but for now, consider these general principles:
- The data that must be visualized is mapped to aesthetic attributes, which define how the data should be perceived
- Geometric objects describe what is actually displayed on the plot, such as lines, points, or bars; the geometric objects basically define which kind of plot you are going to draw
- Statistical transformations are applied to the data to group them; examples of statistical transformations would be the smooth line or the regression lines of the previous examples or the binning of the histograms
- Scales represent the connection between the aesthetic spaces and the actual values that should be represented. Scales may also be used to draw legends
- Coordinates represent the coordinate system in which the data is drawn
- Faceting, which we have already mentioned, is the grouping of data in subsets defined by a value of one variable
In ggplot2
, there are two main high-level functions capable of directly creating a plot, qplot()
, and ggplot()
; qplot()
stands for quick plot, and it is a simple function that serves a purpose similar to that served by the plot()
function in graphics
. The ggplot()
function, on the other hand, is a much more advanced function that allows the user to have more control of the plot layout and details. In our journey into the world of ggplot2
, we will see some examples of qplot()
, in particular when we go through the different kinds of graphs, but we will dig a lot deeper into ggplot()
since this last function is more suited to advanced examples.
If you have a look at the different forums based on R programming, there is quite a bit of discussion as to which of these two functions would be more convenient to use. My general recommendation would be that it depends on the type of graph you are drawing more frequently. For simple and standard plots, where only the data should be represented and only the minor modification of standard layouts are required, the qplot()
function will do the job. On the other hand, if you need to apply particular transformations to the data or if you would just like to keep the freedom of controlling and defining the different details of the plot layout, I would recommend that you focus on ggplot()
. As you will see, the code between these functions is not completely different since they are both based on the same underlying philosophy, but the way in which the options are set is quite different, so if you want to adapt a plot from one function to the other, you will essentially need to rewrite your code. If you just want to focus on learning only one of them, I would definitely recommend that you learn ggplot()
.
In the following code, you will see an example of a plot realized with ggplot2
, where you can identify some of the components of the grammar of graphics. The example is realized with the ggplot()
function, which allows a more direct comparison with the grammar of graphics, but coming just after the following code, you could also find the corresponding qplot()
code useful. Both codes generate the graph depicted in Figure 1.7:
require(ggplot2) ## Load ggplot2 data(Orange) ## Load the data ggplot(data=Orange, ## Data used aes(x=circumference,y=age, color=Tree))+ ## Aesthetic geom_point()+ ## Geometry stat_smooth(method="lm",se=FALSE) ## Statistics ### Corresponding code with qplot() qplot(circumference,age,data=Orange, ## Data used color=Tree, ## Aesthetic mapping geom=c("point","smooth"),method="lm",se=FALSE)
This simple example can give you an idea of the role of each portion of code in a ggplot2
graph; you have seen how the main function body creates the connection between the data and the aesthetics we are interested to represent and how, on top of this, you add the components of the plot, as in this case, we added the geometry element of points and the statistical element of regression. You can also notice how the components that need to be added to the main function call are included using the +
sign. One more thing worth mentioning at this point is that if you run just the main body function in the ggplot()
function, you will get an error message. This is because this call is not able to generate an actual plot. The step during which the plot is actually created is when you include the geometric attribute, which, in this case is geom_point()
. This is perfectly in line with the grammar of graphics since, as we have seen, the geometry represents the actual connection between the data and what is represented on the plot. This is the stage where we specify that the data should be represented as points; before that, nothing was specified about which plot we were interested in drawing.