You're reading from Web Application Development with R Using Shiny Build stunning graphics and interactive data visualizations to deliver cutting-edge analytics

Product type Paperback

Published in Sep 2018

Publisher

ISBN-13 9781788993128

Length 238 pages

Edition 3rd Edition

Languages

Tools

Shiny

Concepts

Application Development

Authors (2):

Chris Beeley

Shitalkumar R. Sukhdeve

View More author details

Introduction to the tidyverse

The tidyverse is, according to its homepage, "an opinionated collection of R packages designed for data science" (see https://www.tidyverse.org/). All of the packages are installed using install.packages("tidyverse"), and calling library(tidyverse) loads a subset of these packages, those considered to have the most value in day-to-day data science. Calling library(tidyverse) loads the following packages:

ggplot2: For plotting
dplyr: For data wrangling
tidyr: For tidying (and untidying!) data
readr: Better functions for reading comma- and tab-delimited data and other types of flat files
purrr: For iterating
tibble: Better dataframes
stringr: For dealing with strings
forcats: Better handling of factors, an R property used to describe the categories of a variable, described briefly earlier in the chapter

Installing the tidyverse also installs the following packages, which then need to be loaded separately with their own library() instruction: readxl, haven, jsonlite, xml2, httr, rvest, DBI, lubridate, hms, blob, rlang, magrittr, glue, and broom. For more details on these packages, consult the documentation at tidyverse.org/.

The key word in this description is opinionated. The tidyverse is a set of R packages that work with tidy data, either producing it, or consuming it, or both. Tidy data was described by Hadley Wickham in the Journal of Statistical Software, Vol 59, Issue 10 (https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf). Tidy data obeys three principles:

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

For many R users, their first introduction to tidy data will have been ggplot2, which consumes tidy data and requires other types of data to be munged into this form. In order to explain what this means, we will look at a simple example. For the purposes of this discussion, we will ignore the last principle, which is more about organizing groups of datasets rather than individual datasets.

Let's have a look at a simple example of a messy dataset. In the real world, you will find datasets that are a lot messier than this, but this will serve to illustrate the principles we are using here. The following are the first three rows of the medal table for the Pyeongchang Winter Olympics, which took place in 2018, as an R dataframe:

medals = data.frame(country = c("Norway", "Germany", "Canada"),
 gold = c(14, 14, 11),
 silver = c(14, 10, 8),
 bronze = c(11, 7, 10)
 )

If we print it at the console, it looks like the following screenshot:

This is perhaps the most common sort of messy data you will come across: great as a summary, certainly intelligible to people watching the Winter Olympics, but not tidy. There are medals in three different columns. A tidy dataset would contain only one column for the medal tallies. Let's tidy it up using the tidyr package, which is loaded with library(tidyverse), or you can load it separately with library(tidyr). We can tidy the data very simply using the gather() function. The gather() function takes a dataframe as an argument, along with key and value column names (which you can set to what you like), and the column names that you wish to be gathered (all other columns will be duplicated as appropriate). In this case, we want to gather everything except the country, so we can use -variableName to indicate that we wish to gather everything except variableName. The final code looks like the following:

library(tidyr)
gather(medals, key = Type, value = Medals, -country)

This has a nice tidy output, as shown in the following screenshot:

Ceci n'est pas une pipe

Now that we've covered tidy data, there is one more concept that is very common in the tidyverse that we should discuss. This is the pipe (%>%) from the magrittr package. This is similar to the Unix pipe, and it takes the left-hand side of the pipe and applies the right-hand side function to it. Take the following code:

mpg %>% summary()

The preceding code is equivalent to the following code:

summary(mpg)

As another example, look at the following code:

gapminder %>% filter(year > 1960)

The preceding code is equivalent to the following code:

filter(gapminder, year > 1960)

Piping greatly enhances the readability of code that requires several steps to execute. Take the following code:

x %>% f %>% g %>% h

The preceding code is equivalent to the following code:

h(g(f(x)))

To demonstrate with a real example, take the following code:

groupedData = gapminder %>%
   filter(year > 1960) %>%
   group_by(continent, year) %>%
   summarise(meanLife = mean(lifeExp))

The preceding code is equivalent to the following code:

summarise(
  group_by(
    filter(gapminder, year > 1960),
  continent, year),
meanLife = mean(lifeExp))

Hopefully, it should be obvious which is the easier to read of the two.

Gapminder

Now we've looked at tidying data, let's have a quick look at using dplyr and ggplot to filter, process, and plot some data. In this section, and throughout this book, we're going to be using the Gapminder data that was made famous by Hans Rosling and the Gapminder foundation. An excerpt of this data is available from the gapminder package, as assembled by Jenny Bryan, and it can be installed and loaded very simply using install.packages("gapminder"); library(gapminder). As the package description indicates, it includes, for each of the 142 countries that are included, the values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007.

In order to prepare the data for plotting, we will make use of dplyr, as shown in the following code:

groupedData = gapminder %>%
 filter(year > 1960) %>%
 group_by(continent, year) %>%
 summarise(meanLife = mean(lifeExp))

This single block of code, all executed in one line, produces a dataframe suitable for plotting, and uses chaining to enhance the simplicity of the code. Three separate data operations, filter(), group_by(), and summarise(), are all used, with the results from each being sent to the next instruction using the %>% operator. The three instructions carry out the following tasks:

filter(): This is similar to subset(). This operation only keeps rows that meet certain requirements—in this case, years beyond 1960.
group_by(): This allows operations to be carried out on subsets of data points—in this case, each continent for each of the years within the dataset.
summarise(): This carries out summary functions, such as sum and mean, on several data points—in this case the mean life expectancy within each continent and available year.

So, to summarize, the preceding code filters the data to select only years beyond 1960, groups it by the continent and year, and finds the mean life expectancy within that continent or year. Printing the output from the preceding code yields the following:

As you can see, the output is a tibble, which has a nice print method that only prints the first several rows. Tibbles are very similar to dataframes, and are often produced by default instead of dataframes within the tidyverse. There are some nice differences, but they are fairly interchangeable with dataframes for our purposes, so we will not get sidetracked by the differences here.

Now we have mentioned tibbles, you can see that the dataframe is a nice summary of the mean life expectancy by year and continent.

A simple Shiny-enabled line plot

We have already seen how easy it is to draw line plots in ggplot2. Let's add some Shiny magic to a line plot now. This can be achieved very easily indeed in RStudio by just navigating to File | New File | R Markdown | New Shiny document and installing the dependencies when prompted. Once a title has been added, this will create a new R Markdown document with interactive Shiny elements. R Markdown is an extension of Markdown (see daringfireball.net/projects/markdown/), which is itself a markup language, such as HTML or LaTeX, which is designed to be easy to use and read. R Markdown allows R code chunks to be run within a Markdown document, which renders the contents dynamic. There is more information about Markdown and R Markdown in Chapter 2, Shiny First Steps. This section gives a very rapid introduction to the type of results possible using Shiny-enabled R Markdown documents.

For more details on how to run interactive documents outside RStudio, refer to goo.gl/Ngubdo. By default, a new document will have placeholder code in it which you can run to demonstrate the functionality. We will add the following:

---
title: "Gapminder"
author: "Chris Beeley"
output: html_document
runtime: shiny
---

```{r, echo = FALSE, message = FALSE}

library(tidyverse)
library(gapminder)

inputPanel( 
  checkboxInput("linear", label = "Add trend line?", value = FALSE) 
) 

# draw the plot 
renderPlot({ 
  
  thePlot = gapminder %>%
    filter(year > 1960) %>%
    group_by(continent, year) %>%
    summarise(meanLife = mean(lifeExp)) %>%
    ggplot(aes(x = year, y = meanLife, 
      group = continent, colour = continent)) +
    geom_line()
  
  if(input$linear){ 
    thePlot = thePlot + geom_smooth(method = "lm") 
  } 
  
  print(thePlot) 
})

```

The first part between the --- is the YAML, which performs the setup of the document. In the case of producing this document within RStudio, this will already be populated for you.

R chunks are marked as shown, with ```{r} to begin and ``` to close. The echo and message arguments are optional—we use them here to suppress output of the actual code and any messages from R in the final document.

We'll go into more detail about how Shiny inputs and outputs are set up later on in the book. For now, just know that the input is set up with a call to checkboxInput(), which, as the name suggests, creates a checkbox. It's given a name ("linear") and a label to display to the user ("Add trend line?"). The output, being a plot, is wrapped in renderPlot(). You can see the checkbox value being accessed on the if(input$linear){...} line. Shiny inputs are always accessed with input$ and then their name, so, in this case, input$linear. When the box is clicked—that is, when it equals TRUE—we can see a trend line being added with geom_smooth(). We'll go into more detail about how all of this code works later in the book; this is a first look so that you can start to see how different tasks are carried out using R and Shiny.

You'll have an interactive graphic once you run the document (click on Run document in RStudio or use the run() command from the rmarkdown package), as shown in the following screenshot: