You're reading from R Bioinformatics Cookbook - Second Edition

Product type Book

Published in Oct 2023

Publisher Packt

ISBN-13 9781837634279

Pages 396 pages

Edition 2nd Edition

Languages

Concepts

Bioinformatics

Author (1):

Dan MacLean

Table of Contents (16) Chapters

Preface

1. Chapter 1: Setting Up Your R Bioinformatics Working Environment

2. Chapter 2: Loading, Tidying, and Cleaning Data in the tidyverse

3. Chapter 3: ggplot2 and Extensions for Publication Quality Plots

4. Chapter 4: Using Quarto to Make Data-Rich Reports, Presentations, and Websites

5. Chapter 5: Easily Performing Statistical Tests Using Linear Models

6. Chapter 6: Performing Quantitative RNA-seq

7. Chapter 7: Finding Genetic Variants with HTS Data

8. Chapter 8: Searching Gene and Protein Sequences for Domains and Motifs

9. Chapter 9: Phylogenetic Analysis and Visualization

10. Chapter 10: Analyzing Gene Annotations

11. Chapter 11: Machine Learning with mlr3

12. Chapter 12: Functional Programming with purrr and base R

13. Chapter 13: Turbo-Charging Development in R with ChatGPT

14. Index

Why subscribe?

15. Other Books You May Enjoy

Reformatting and extracting existing data into new columns using stringr

Text manipulation is important in bioinformatics as it allows, among other things, for efficient processing and analysis of DNA and protein sequence annotation data. The R stringr package is a good choice for text manipulation because it provides a simple and consistent interface for common string operations, such as pattern matching and string replacement. stringr is built on top of the powerful stringi manipulation library, making it a fast and efficient tool for working with arbitrary strings. In this recipe, we’ll look at rationalizing data held in messy FAST-All (FASTA)-style sequence headers.

Getting ready

We’ll use the Arabidopsis gene names in the ath_seq_names vector provided by the rbioinfcookbook package and the stringr package.

How to do it…

To reformat gene names using stringr, we can proceed as follows:

Capture the ATxGxxxxx format IDs:

library(rbioinfcookbook)library(stringr)ids <- str_extract(ath_seq_names, "^AT\\dG.*\\.\\d")

Separate the string into elements and extract the description:

description <- str_split(ath_seq_names, "\\|", simplify = TRUE)[,3] |>   str_trim()

Separate the string into elements and extract the gene information:

info <- str_split(ath_seq_names, "\\|", simplify = TRUE)[,4] |>   str_trim()

Match and recall the chromosome and coordinates:

chr <- str_match(info, "chr(\\d):(\\d+)-(\\d+)")

Find the number of characters the strand information begins at and use that as an index:

strand_pos <- str_locate(info, "[FORWARD|REVERSE]")strand <- str_sub(info, start=strand_pos, end=strand_pos+1)

Extract the length information:

lengths <- str_match(info, "LENGTH=(\\d+)$")[,2]

Combine all captured information into a data frame:

results <- data.frame(  ids = ids,  description = description,  chromosome = as.integer(chr[,2]),  start = as.integer(chr[,3]),  end = as.integer(chr[,4]),  strand = strand,  length = as.integer(lengths))

And that gives us a very nice, reformatted data frame.

How it works…

The R code uses the stringr library to extract, split, and manipulate information from a vector of sequence names (ath_seq_names) and assigns the resulting information to different variables. The rbioinfcookbook library provides the initial ath_seq_names vector.

The first step of the recipe uses the str_extract() function from stringr to extract a specific pattern of characters. The "^AT\dG.*.\d" regex matches any string that starts with "AT", followed by one digit, then "G", then any number of characters, then a dot, and finally one digit. stringr operations are vectorized so that all entries in them are processed.

Steps 2 and 3 are similar and use the str_split() function to split the seq_names vector by the "|" character; the simplify option returns a matrix of results with a column for each substring. The str_trim() function removes troublesome leading and trailing whitespace from the resulting substring. The third and fourth columns of the resulting matrix are saved.

The following line of code uses the str_match() function to extract specific substrings from the info variable that match the "chr(\d):(\d+)-(\d+)" regex. This regex matches any string that starts with "chr", followed by one digit, then ":", then one or more digits, then "-", and finally one or more digits. The '()' bracket symbols mark the piece of text to save; each saved piece goes into a column in the matrix.

The next line of code uses the str_locate() function to find the position of the first occurrence of either FORWARD or REVERSE in the info variable. The resulting position is then used to extract the character at that position using str_sub(). The last line of code uses the str_match() function to extract the substring that starts with "LENGTH=" and ends with one or more digits from the info variable.

Finally, the code creates a data frame result by combining the extracted and subsetted variables, assigning appropriate types for each column.