Population scale clustering and geographic ethnicity
Next-generation genome sequencing (NGS) reduces overhead and time for genomic sequencing, leading to big data production in an unprecedented way. In contrast, analyzing this large-scale data is computationally expensive and increasingly becomes the key bottleneck. This increase in NGS data in terms of number of samples overall and features per sample demands solutions for massively parallel data processing, which imposes extraordinary challenges on machine learning solutions and bioinformatics approaches. The use of genomic information in medical practice requires efficient analytical methodologies to cope with data from thousands of individuals and millions of their variants.
One of the most important tasks is the analysis of genomic profiles to attribute individuals to specific ethnic populations, or the analysis of nucleotide haplotypes for disease susceptibility. The data from the 1000 Genomes project serves as the prime source to analyze...