You're reading from R Bioinformatics Cookbook Utilize R packages for bioinformatics, genomics, data science, and machine learning

Product type Paperback

Published in Oct 2023

Publisher Packt

ISBN-13 9781837634279

Length 396 pages

Edition 2nd Edition

Languages

Tools

ChatGPT

Concepts

Bioinformatics

Author (1):

Dan MacLean

View More author details

Table of Contents (16) Chapters

Preface

1. Chapter 1: Setting Up Your R Bioinformatics Working Environment

2. Chapter 2: Loading, Tidying, and Cleaning Data in the tidyverse FREE CHAPTER

3. Chapter 3: ggplot2 and Extensions for Publication Quality Plots

4. Chapter 4: Using Quarto to Make Data-Rich Reports, Presentations, and Websites

5. Chapter 5: Easily Performing Statistical Tests Using Linear Models

6. Chapter 6: Performing Quantitative RNA-seq

7. Chapter 7: Finding Genetic Variants with HTS Data

8. Chapter 8: Searching Gene and Protein Sequences for Domains and Motifs

9. Chapter 9: Phylogenetic Analysis and Visualization

10. Chapter 10: Analyzing Gene Annotations

11. Chapter 11: Machine Learning with mlr3

12. Chapter 12: Functional Programming with purrr and base R

13. Chapter 13: Turbo-Charging Development in R with ChatGPT

14. Index

Why subscribe?

15. Other Books You May Enjoy

Tidying a wide format table into a tidy table with tidyr

The tidyr package in R is a package that provides tools for tidying and reshaping data. It is designed to make it easy to work with data in a consistent and structured format, which is known as a tidy format. Tidy data is a standard way of organizing data that makes it easy to perform data analysis and visualization.

The main principles of tidy data are as follows:

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

Data in a tidy format is easier to work with because the structure of the data is consistent, facilitating operations such as filtering, grouping, and reshaping the data. Tidy data is also more compatible with various data visualization and analysis tools, such as ggplot2, dplyr, and other tidyverse packages.

Our aim in this recipe will be to take a wide format data frame where a lot of information is hiding in column names and squeeze and reformat them into a data column of their own and rationalize them in the process.

Getting ready

We’ll need the rbioinfcookbook and tidyr packages. We’ll use the finished output from recipe 1, which is saved in the package.

How to do it…

We have to use just one function, but the options are many.

Specify the transformation to the table:

library(rbioinfcookbook)library(dplyr)
library(tidyr)
long_df <- census_df |> 
  rename("0_to_4" = "under_4", "90_to_120" = "over_90") |> 
  pivot_longer(
    cols = contains("_to_"),
    names_to = c("age_from", "age_to"),
    names_pattern = "(.*)_to_(.*)",
    names_transform = list("age_from" = as.integer,
                         age.to = as.integer),
    values_to = "count"
  )

And that’s it. This short recipe is very dense, though.

How it works…

The tidyr package has functions that work by allowing the user to specify a particular transformation that will be applied to the data frame to generate a new one. In this single step, we specify a table row-count increasing operation that will find all the columns that contain age information. Next, we split the title of that column into data for two new columns—one for the lower boundary of the age category and one for the upper boundary of the age category. Then, we change the type of those new columns to integer and, lastly, put the actual counts in a new column.

The first function in this pipeline is code from dplyr, which helps us rename column headings. Our age data column names are largely consistent, except for the lower bound and the upper one, so we rename those columns to match the pattern of the others, simplifying the transform specification.

The pivot_longer() function specifies the transform in the arguments, with the cols argument we choose to operate on any columns containing the text to. The names_pattern argument takes a regular expression (regex) that captures the bits of text before and after the to string in the column names and uses them as values for the columns defined in the names_to argument. The actual counts from the cell are put into a new column called counts. The transformation is then applied in one step and reduces the data frames column count to eight, increasing the row count to 6,935, and in the process making the data tidy and easier to use in downstream packages.