Preparing a dataset for analysis
Our starting point will be a VCF file (or equivalent) with calls made by a genotyper (Genome Analysis Toolkit (GATK) in our case), including annotations. As we will be filtering NGS data, we need reliable decision criteria to call a site. So, how do we get that information? Generally, we can’t, but if we need to do so, there are three basic approaches:
- Using a more robust sequencing technology for comparison – for example, using Sanger sequencing to verify NGS datasets. This is cost-prohibitive and can only be done for a few loci.
- Sequencing closely related individuals, for example, two parents and their offspring. In this case, we use Mendelian inheritance rules to decide whether a certain call is acceptable or not. This was the strategy used by both the Human Genome Project and the Anopheles gambiae 1000 Genomes project.
- Finally, we can use simulations. This setup is not only quite complex but also of dubious reliability...