Performing multiple alignments of proteins or genes
Aligning sequences as a task before building phylogenetic trees or as an end in itself to determine conserved and divergent regions is a mainstay in bioinformatics analysis and is amply covered in R with ape
and in Bioconductor with the msa
and DECIPHER
packages. We’ll look at the extremely straightforward procedures for going from sequence to alignment in this recipe.
There are different techniques for different sequence length categories. In the first part of this recipe, we’ll look at sequences on the order of a couple of thousand residues or smaller, such as those that represent genes and proteins.
Getting ready
For this recipe, you’ll need the msa
package. This is a pretty hefty package and includes external software: Clustal, Clustal Omega, and Muscle. The ape
and seqinR
packages are also needed. As a test dataset, we’ll use some hemoglobin protein sequences stored in the rbioinfcookbook
...