In this chapter, we will explore the world of bioinformatics. We will use Markov models, k-nearest neighbors algorithms, support vector machines, and other common classifiers, to classify short E. coli DNA sequences. For this project, will use a dataset from the UCI machine learning repository that has 106 DNA sequences, with 57 sequential nucleotides each. You will learn how to import data from the UCI repository, convert text input to numerical data, build and train classification algorithms, and compare and contrast classification machine learning algorithms.
We will cover the following topics:
- Classifying DNA sequences
- Data preprocessing