You're reading from R Bioinformatics Cookbook Utilize R packages for bioinformatics, genomics, data science, and machine learning

Product type Paperback

Published in Oct 2023

Publisher Packt

ISBN-13 9781837634279

Length 396 pages

Edition 2nd Edition

Languages

Tools

ChatGPT

Concepts

Bioinformatics

Author (1):

Dan MacLean

View More author details

Table of Contents (16) Chapters

Preface

1. Chapter 1: Setting Up Your R Bioinformatics Working Environment

2. Chapter 2: Loading, Tidying, and Cleaning Data in the tidyverse FREE CHAPTER

3. Chapter 3: ggplot2 and Extensions for Publication Quality Plots

4. Chapter 4: Using Quarto to Make Data-Rich Reports, Presentations, and Websites

5. Chapter 5: Easily Performing Statistical Tests Using Linear Models

6. Chapter 6: Performing Quantitative RNA-seq

7. Chapter 7: Finding Genetic Variants with HTS Data

8. Chapter 8: Searching Gene and Protein Sequences for Domains and Motifs

9. Chapter 9: Phylogenetic Analysis and Visualization

10. Chapter 10: Analyzing Gene Annotations

11. Chapter 11: Machine Learning with mlr3

12. Chapter 12: Functional Programming with purrr and base R

13. Chapter 13: Turbo-Charging Development in R with ChatGPT

14. Index

Why subscribe?

15. Other Books You May Enjoy

Computing new data columns from existing ones and applying arbitrary functions using mutate()

Data frames are the core data structure in R for storing and manipulating tabular data. They are similar to a table in a relational database or a spreadsheet, with rows representing observations and columns representing variables. The mutate() function in dplyr is used to add new columns to a data frame by applying a function or calculation to existing columns; it can be used on both data frames and nested datasets. A nested dataset is a data structure that contains multiple levels of information, such as lists or data frames within data frames.

Getting ready

In this recipe, we’ll use a very small data frame that will be created in the recipe and the dplyr, tidyr, and purrr packages.

How to do it…

The functionality offered by the mutate() pattern is exemplified in the following steps:

Add some new columns:

library(dplyr)df <- data.frame(gene_id = c("gene1", "gene2", "gene3"),                 tissue1 = c(5.1, 7.3, 8.2),                 tissue2 = c(4.8, 6.1, 9.5))df <- df |> mutate(log2_tissue1 = log2(tissue1),                    log2_tissue2 = log2(tissue2))

Conditionally operate on columns:

df |> mutate_if(is.numeric, log2)df |> mutate_at(vars(starts_with("tissue")), log2)

Operate across columns:

library(dplyr)df <- data.frame(gene_id = c("gene1", "gene2", "gene3"),                 tissue1 = c(5.1, 7.3, 8.2),                 tissue2 = c(4.8, 6.1, 9.5),                 tissue3 = c(8.5, 12.5, 6.5))df <- df |>   rowwise() |>    mutate(mean = mean(c(tissue1, tissue2, tissue3)),         stddev = sd(c(tissue1, tissue2, tissue3))         )

Operate on nested data:

df <- data.frame(gene_id = c("gene1", "gene2", "gene3"),                 tissue1_value = c(5.1, 7.3, 8.2),                 tissue1_pvalue = c(0.01, 0.05, 0.001),                 tissue2_value = c(4.8, 6.1, 9.5),                 tissue2_pvalue = c(0.03, 0.04, 0.001)                 ) |>       tidyr::nest(-gene_id, .key = "tissue")df <- df |> mutate(tissue = purrr::map(tissue, ~ mutate(.x, value_log2 = log2(.[1]),pvalue_log2 = log2(.[2])) ))

These are the various ways we can work with mutate() on different data frames to different ends.

How it works…

In step 1, we start by creating a data frame containing three genes, with columns representing the values of each gene. Then, we use the [mutate(){custom-style='P - Code'} function in its most basic form to apply the log2 transformation of the values into new columns.

In step 2, we apply the transformation conditionally with mutate_if(), which applies the specified function to all columns that match the specified condition (in this case, is.numeric), and mutate_at() to apply the log2 function to all columns that start with the name tissue.

In step 3, we create a bigger data frame of expression data and then use the rowwise() function in conjunction with mutate() to add two new columns, mean and stddev, which contain the mean and standard deviation of the expression values across all three tissues for each gene. If we just use mutate() on the vector, the function will be applied to the entire column instead of the rows. As we need to calculate the mean and standard deviation of the expression values across all tissues, this would be difficult to achieve just by using mutate() on vectors in the standard column-wise fashion.

Finally, we look at using mutate() in nested data frames. We begin by creating a data frame of genes and expression values and use the nest() function to create a nested data frame. We can then use the combination of mutate() and map() functions from the purrr package, to extract the tissue1_value and tissue2_value columns from the nested part of the data frame.