Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
R Bioinformatics Cookbook - Second Edition

You're reading from  R Bioinformatics Cookbook - Second Edition

Product type Book
Published in Oct 2023
Publisher Packt
ISBN-13 9781837634279
Pages 396 pages
Edition 2nd Edition
Languages
Author (1):
Dan MacLean Dan MacLean
Profile icon Dan MacLean
Toc

Table of Contents (16) Chapters close

Preface 1. Chapter 1: Setting Up Your R Bioinformatics Working Environment 2. Chapter 2: Loading, Tidying, and Cleaning Data in the tidyverse 3. Chapter 3: ggplot2 and Extensions for Publication Quality Plots 4. Chapter 4: Using Quarto to Make Data-Rich Reports, Presentations, and Websites 5. Chapter 5: Easily Performing Statistical Tests Using Linear Models 6. Chapter 6: Performing Quantitative RNA-seq 7. Chapter 7: Finding Genetic Variants with HTS Data 8. Chapter 8: Searching Gene and Protein Sequences for Domains and Motifs 9. Chapter 9: Phylogenetic Analysis and Visualization 10. Chapter 10: Analyzing Gene Annotations 11. Chapter 11: Machine Learning with mlr3 12. Chapter 12: Functional Programming with purrr and base R 13. Chapter 13: Turbo-Charging Development in R with ChatGPT 14. Index 15. Other Books You May Enjoy

Reformatting and extracting existing data into new columns using stringr

Text manipulation is important in bioinformatics as it allows, among other things, for efficient processing and analysis of DNA and protein sequence annotation data. The R stringr package is a good choice for text manipulation because it provides a simple and consistent interface for common string operations, such as pattern matching and string replacement. stringr is built on top of the powerful stringi manipulation library, making it a fast and efficient tool for working with arbitrary strings. In this recipe, we’ll look at rationalizing data held in messy FAST-All (FASTA)-style sequence headers.

Getting ready

We’ll use the Arabidopsis gene names in the ath_seq_names vector provided by the rbioinfcookbook package and the stringr package.

How to do it…

To reformat gene names using stringr, we can proceed as follows:

  1. Capture the ATxGxxxxx format IDs:
    library(rbioinfcookbook)library(stringr)ids <- str_extract(ath_seq_names, "^AT\\dG.*\\.\\d")
  2. Separate the string into elements and extract the description:
    description <- str_split(ath_seq_names, "\\|", simplify = TRUE)[,3] |>   str_trim()
  3. Separate the string into elements and extract the gene information:
    info <- str_split(ath_seq_names, "\\|", simplify = TRUE)[,4] |>   str_trim()
  4. Match and recall the chromosome and coordinates:
    chr <- str_match(info, "chr(\\d):(\\d+)-(\\d+)")
  5. Find the number of characters the strand information begins at and use that as an index:
    strand_pos <- str_locate(info, "[FORWARD|REVERSE]")strand <- str_sub(info, start=strand_pos, end=strand_pos+1)
  6. Extract the length information:
    lengths <- str_match(info, "LENGTH=(\\d+)$")[,2]
  7. Combine all captured information into a data frame:
    results <- data.frame(  ids = ids,  description = description,  chromosome = as.integer(chr[,2]),  start = as.integer(chr[,3]),  end = as.integer(chr[,4]),  strand = strand,  length = as.integer(lengths))

And that gives us a very nice, reformatted data frame.

How it works…

The R code uses the stringr library to extract, split, and manipulate information from a vector of sequence names (ath_seq_names) and assigns the resulting information to different variables. The rbioinfcookbook library provides the initial ath_seq_names vector.

The first step of the recipe uses the str_extract() function from stringr to extract a specific pattern of characters. The "^AT\dG.*.\d" regex matches any string that starts with "AT", followed by one digit, then "G", then any number of characters, then a dot, and finally one digit. stringr operations are vectorized so that all entries in them are processed.

Steps 2 and 3 are similar and use the str_split() function to split the seq_names vector by the "|" character; the simplify option returns a matrix of results with a column for each substring. The str_trim() function removes troublesome leading and trailing whitespace from the resulting substring. The third and fourth columns of the resulting matrix are saved.

The following line of code uses the str_match() function to extract specific substrings from the info variable that match the "chr(\d):(\d+)-(\d+)" regex. This regex matches any string that starts with "chr", followed by one digit, then ":", then one or more digits, then "-", and finally one or more digits. The '()' bracket symbols mark the piece of text to save; each saved piece goes into a column in the matrix.

The next line of code uses the str_locate() function to find the position of the first occurrence of either FORWARD or REVERSE in the info variable. The resulting position is then used to extract the character at that position using str_sub(). The last line of code uses the str_match() function to extract the substring that starts with "LENGTH=" and ends with one or more digits from the info variable.

Finally, the code creates a data frame result by combining the extracted and subsetted variables, assigning appropriate types for each column.

You have been reading a chapter from
R Bioinformatics Cookbook - Second Edition
Published in: Oct 2023 Publisher: Packt ISBN-13: 9781837634279
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}