I already stated that all the 24 VCF files contribute 820 GB of data. Therefore, I decided to use the genetic variant of chromosome Y only one two make the demonstration clearer. The size is around 160 MB, which is not meant to pose huge computational challenges. You can download all the VCF files as well as the panel file from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/.
Let us get started. We start by creating SparkSession, the gateway for the Spark application:
val spark:SparkSession = SparkSession
.builder()
.appName("PopStrat")
.master("local[*]")
.config("spark.sql.warehouse.dir", "C:/Exp/")
.getOrCreate()
Then let's show Spark the path of both VCF and the panel file:
val genotypeFile = "<path>/ALL.chrY.phase3_integrated_v2a.20130502.genotypes...