Tidying a wide format table into a tidy table with tidyr
The tidyr
package in R is a package that provides tools for tidying and reshaping data. It is designed to make it easy to work with data in a consistent and structured format, which is known as a tidy format. Tidy data is a standard way of organizing data that makes it easy to perform data analysis and visualization.
The main principles of tidy data are as follows:
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
Data in a tidy format is easier to work with because the structure of the data is consistent, facilitating operations such as filtering, grouping, and reshaping the data. Tidy data is also more compatible with various data visualization and analysis tools, such as ggplot2
, dplyr
, and other tidyverse
packages.
Our aim in this recipe will be to take a wide format data frame where a lot of information is hiding in column names and squeeze and reformat them into a data column of their own and rationalize them in the process.
Getting ready
We’ll need the rbioinfcookbook
and tidyr
packages. We’ll use the finished output from recipe 1, which is saved in the package.
How to do it…
We have to use just one function, but the options are many.
Specify the transformation to the table:
library(rbioinfcookbook)library(dplyr) library(tidyr) long_df <- census_df |> rename("0_to_4" = "under_4", "90_to_120" = "over_90") |> pivot_longer( cols = contains("_to_"), names_to = c("age_from", "age_to"), names_pattern = "(.*)_to_(.*)", names_transform = list("age_from" = as.integer, age.to = as.integer), values_to = "count" )
And that’s it. This short recipe is very dense, though.
How it works…
The tidyr
package has functions that work by allowing the user to specify a particular transformation that will be applied to the data frame to generate a new one. In this single step, we specify a table row-count increasing operation that will find all the columns that contain age information. Next, we split the title of that column into data for two new columns—one for the lower boundary of the age category and one for the upper boundary of the age category. Then, we change the type of those new columns to integer and, lastly, put the actual counts in a new column.
The first function in this pipeline is code from dplyr
, which helps us rename column headings. Our age data column names are largely consistent, except for the lower bound and the upper one, so we rename those columns to match the pattern of the others, simplifying the transform specification.
The pivot_longer()
function specifies the transform in the arguments, with the cols
argument we choose to operate on any columns containing the text to
. The names_pattern
argument takes a regular expression (regex) that captures the bits of text before and after the to
string in the column names and uses them as values for the columns defined in the names_to
argument. The actual counts from the cell are put into a new column called counts
. The transformation is then applied in one step and reduces the data frames column count to eight, increasing the row count to 6,935, and in the process making the data tidy and easier to use in downstream packages.
See also
The recipe uses a regex to describe a pattern in text. If you haven’t seen these before and need a primer, try typing ?"regular expression"
to view the R help on the topic.