Reformatting and extracting existing data into new columns using stringr
Text manipulation is important in bioinformatics as it allows, among other things, for efficient processing and analysis of DNA and protein sequence annotation data. The R stringr
package is a good choice for text manipulation because it provides a simple and consistent interface for common string operations, such as pattern matching and string replacement. stringr
is built on top of the powerful stringi
manipulation library, making it a fast and efficient tool for working with arbitrary strings. In this recipe, we’ll look at rationalizing data held in messy FAST-All (FASTA)-style sequence headers.
Getting ready
We’ll use the Arabidopsis gene names in the ath_seq_names
vector provided by the rbioinfcookbook
package and the stringr
package.
How to do it…
To reformat gene names using stringr
, we can proceed as follows:
- Capture the
ATxGxxxxx
format IDs:library(rbioinfcookbook)library(stringr)ids <- str_extract(ath_seq_names, "^AT\\dG.*\\.\\d")
- Separate the string into elements and extract the description:
description <- str_split(ath_seq_names, "\\|", simplify = TRUE)[,3] |> str_trim()
- Separate the string into elements and extract the gene information:
info <- str_split(ath_seq_names, "\\|", simplify = TRUE)[,4] |> str_trim()
- Match and recall the chromosome and coordinates:
chr <- str_match(info, "chr(\\d):(\\d+)-(\\d+)")
- Find the number of characters the strand information begins at and use that as an index:
strand_pos <- str_locate(info, "[FORWARD|REVERSE]")strand <- str_sub(info, start=strand_pos, end=strand_pos+1)
- Extract the length information:
lengths <- str_match(info, "LENGTH=(\\d+)$")[,2]
- Combine all captured information into a data frame:
results <- data.frame( ids = ids, description = description, chromosome = as.integer(chr[,2]), start = as.integer(chr[,3]), end = as.integer(chr[,4]), strand = strand, length = as.integer(lengths))
And that gives us a very nice, reformatted data frame.
How it works…
The R code uses the stringr
library to extract, split, and manipulate information from a vector of sequence names (ath_seq_names
) and assigns the resulting information to different variables. The rbioinfcookbook
library provides the initial ath_seq_names
vector.
The first step of the recipe uses the str_extract()
function from stringr
to extract a specific pattern of characters. The "^AT\dG.*.\d"
regex matches any string that starts with "AT"
, followed by one digit, then "G"
, then any number of characters, then a dot, and finally one digit. stringr
operations are vectorized so that all entries in them are processed.
Steps 2 and 3 are similar and use the str_split()
function to split the seq_names
vector by the "|"
character; the simplify
option returns a matrix of results with a column for each substring. The str_trim()
function removes troublesome leading and trailing whitespace from the resulting substring. The third and fourth columns of the resulting matrix are saved.
The following line of code uses the str_match()
function to extract specific substrings from the info
variable that match the "chr(\d):(\d+)-(\d+)"
regex. This regex matches any string that starts with "chr"
, followed by one digit, then ":"
, then one or more digits, then "-"
, and finally one or more digits. The '()'
bracket symbols mark the piece of text to save; each saved piece goes into a column in the matrix.
The next line of code uses the str_locate()
function to find the position of the first occurrence of either FORWARD
or REVERSE
in the info
variable. The resulting position is then used to extract the character at that position using str_sub()
. The last line of code uses the str_match()
function to extract the substring that starts with "LENGTH="
and ends with one or more digits from the info
variable.
Finally, the code creates a data frame result by combining the extracted and subsetted variables, assigning appropriate types for each column.