Reading genomics data with Zarr
Zarr (https://zarr.readthedocs.io/en/stable/) stores array-based data—such as NumPy —in a hierarchical structure on disk and cloud storage. The data structures used by Zarr to represent arrays are not only very compact but also allow for parallel reading and writing, something we will see in the next recipes. In this recipe, we will be reading and processing genomics data from the Anopheles gambiae 1000 Genomes project (https://malariagen.github.io/vector-data/ag3/download.html). Here, we will simply do sequential processing to ease the introduction to Zarr; in the following recipe, we will do parallel processing. Our project will be computing the missingness for all genomic positions sequenced for a single chromosome.
Getting ready
The Anopheles 1000 Genomes data is available from Google Cloud Platform (GCP). To download data from GCP, you will need gsutil
, available from https://cloud.google.com/storage/docs/gsutil_install. After...