Novel feature detection in proteins
Sometimes, we’ll have a list of protein sequences that have come from some analysis or experiment that are in some way biologically related. We might wish to determine the parts of those proteins that are responsible for the action. Domain and motif finding, as we’ve done in the preceding recipes, can only be helpful if we’ve seen the domains before or the sequence is well conserved or statistically over-represented. A different approach is to try machine learning, in which we build a model that can classify our proteins accurately and use the properties of that mode to show us which parts of the proteins result in the classification. We’ll take that approach in this recipe by training and analyzing a support vector machine (SVM).
Getting ready
For this recipe, we’ll need the kebabs
and Biostrings
Bioconductor packages, as well as the e1071
and readr
packages. We’ll also need two input data files that...