Preparing the dataset for analysis
Our starting point will be a VCF file (or equivalent), with calls made by a genotyper (Genome Analysis Toolkit (GATK) in our case), including the annotations. As we will be filtering NGS data, we need reliable decision criteria to call a site. So, how do we get that information? Generally, we can't, but if we need to do it, there are three basic approaches:
- Using a more robust sequencing technology for comparison; for example, using Sanger sequencing to verify NGS datasets. This is cost-prohibitive and can only be done for a few loci.
- Sequencing closely related individuals, for example, two parents and their offspring. In this case, we use Mendelian inheritance rules to devise if a certain call is acceptable or not. This was the strategy used by both the human and Anopheles 1,000 Genomes Projects.
- Finally, we can use simulations. This setup is not only quite complex, but also of dubious reliability. It's more of a theoretical option.
In this chapter, we will...