CNNs for genomics
Even though CNNs are primarily used for unstructured data such as images, text, audio, and so on, they are also powerful tools for non-image data such as DNA. Unfortunately, the raw DNA sequence data cannot be provided to CNNs as input for feature extraction. It has to be converted to numerical representation before it can be used by CNN. The first thing to note for non-numeric data such as a DNA sequence is that you will have to first convert the 1D DNA sequence data to a one-hot encoded structure (Figure 5.8):
Figure 5.8 – Example of one-hot encoding for a DNA sequence
As shown in the preceding diagram, each nucleotide in the DNA sequences is represented as a one-hot vector: A = [1000], C = [0100], G = [0001], and T = [0010]. The one-hot encoded matrix can then be fed into the model for training purposes. Please note that one-hot encoding is not the only way of representing DNA sequences to a CNN. There is also label encoding in which...