Applying ML to genomics
Before we dive into ML model details, let’s first understand the genomic data, which is stored as DNA in every organism. There are four chemical bases present in DNA, namely Adenine (A), Thymine (T), Cytosine (C) and Guanine (G). They always bond in particular manner for example, Adenine will always bond with Thymine, and Cytosine with Guanine. The combination of these chemical bases is what makes up a DNA sequence, represented by the letters A, T, C, and G. A 20-length example of a DNA sequence is ACTCCACAGTACCTCCGAGA
. A single complete sequence of the human genome is around 3 billion base pairs (bp) long and takes about 200 GB of data storage (https://www.science.org/doi/10.1126/science.abj6987).
However, for analyzing the DNA sequence, we don’t need the complete human genome sequence. Usually, we analyze a part of the human DNA; for example, to determine hair growth or skin growth, a lab technician will take a small section of human skin...