Bar charts
Bar charts are usually used to explore how one (or more) categorical variables are distributed. In qplot()
, this is done using the geom
option bar. This geometry counts the number of occurrences of each factor variable, which appears in the data. To show an example of the bar chart, we will use the movies
dataset, which is included within the ggplot2
package. We have already seen how to recall the dataset included with the basic installation of R, but if you are interested in the list of datasets within a specific package (ggplot2
in this case), you can use the following code:
require(ggplot2) ## Load ggplot2 if needed data(package="ggplot2") ## List of dataset within ggplot2
The movies
dataset contains information about movies, including the rating, from the http://imdb.com/ website. You can get a more detailed description in the help page of the dataset.
This dataset contains different variables but, for our example, we will not need all of them, so let´s rearrange a bit of its content. For our exercise, we are first interested in knowing how many movies were produced in each category - Action, Animation, Comedy, Drama, Documentary, and Romance. Let's also keep in the dataset the information about the movie budget, whether it was a short or regular movie, its year, and so on. So, the steps covered in our code are:
- Load the data.
- Extract from the dataset the information for each movie type concerning budget and length.
- Create a factor variable containing the movie type.
The header of our final dataset, called myMovieData
, will then be Budget
, Short
, Year
, and Type
. So, here's our code:
d1 <-data.frame(movies[movies$Action==1, c("budget", "Short", "year")]) d1$Type <- "Animation" d2 <-data.frame(movies[movies$Animation==1, c("budget", "Short", "year")]) d2$Type <- "Animation" d3 <-data.frame(movies[movies$Comedy==1, c("budget", "Short", "year")]) d3$Type <- "Comedy" d4 <-data.frame(movies[movies$Drama==1, c("budget", "Short", "year")]) d4$Type <- "Drama" d5 <-data.frame(movies[movies$Documentary==1, c("budget", "Short", "year")]) d5$Type <- "Documentary" d6 <-data.frame(movies[movies$Romance==1, c("budget", "Short", "year")]) d6$Type <- "Romance" myMovieData <- rbind(d1, d2, d3, d4, d5, d6) names(myMovieData) <- c("Budget", "Short", "Year", "Type" )
Now that our data is ready, let's create our first bar chart. In general, we will follow the same structure as the other plots, just replacing the geom
specification:
qplot(Type, data=myMovieData , geom="bar", fill=Type)
This standard bar chart will generate bars representing the count of each element (the movie type) for each type available. Since we have also assigned the fill
aesthetic attribute to the same type variable, we also obtain the coloring of each bar in a different way. The plot generated is represented in Figure 2.5:
In the plot we just created, the bars are colored differently depending on the movie type. However, we can use the fill
argument in a more useful way. In fact, we could also require a different color based on the value of a second variable, in this way adding more information to the plot. In our simple example, we can split each bar by the relative amount of a short or regular movie. This is done simply by assigning the Short
column to the fill
argument as shown in the following code:
qplot(Type, data=myMovieData , geom="bar", fill=factor(Short))
The result is shown in Figure 2.6. As illustrated, we can now see the movie counts for short and regular movies, summing up the total number of movies for each type.
As you probably noticed in this last example, we assigned the Short
variable to the fill
argument, but in the assignment, we also converted the variable to factor
, while in the previous example, when we used the Type
variable, we did not do so. The reason is that the fill
aesthetic attribute, in this case, needed a discrete variable, which defined different levels. These, in turn, were assigned to different colors. The Type
variable of the previous example was already a factor, where each level represented the movie type. On the other hand, the Short
variable is actually numeric: 0 for regular movies and 1 for short movies. For this reason, we had to convert it first to a factor
, so qplot
could identify this variable as indicating two levels of a discrete variable. We will also discuss in detail the assignment of discrete and continuous variables in Chapter 4, Advanced Plotting Techniques. You can check out the class of the two columns with the following code:
> class(myMovieData$Short) [1] "integer" > class(myMovieData$Type) [1] "factor"
One last thing to mention about bar charts is the position
argument of the qplot
function. Such argument defines the way you would like to display the bars within the chart. The three main options are stack
, dodge
, and fill
. The stack
option puts the bars with the same x value on top of each other; the dodge
option places the bars next to each other for the same x value; and the fill
option places the bars on top of each other but normalizes the height to 1. The following code shows the position
adjustment applied to our last example:
qplot(Type, data=myMovieData, geom="bar", fill=factor(Short), position="stack") qplot(Type, data=myMovieData, geom="bar", fill=factor(Short), position="dodge") qplot(Type, data=myMovieData, geom="bar", fill=factor(Short), position="fill")
Figure 2.7 shows you the resulting plot for each option: