Using high-performance data formats – Parquet
In the previous recipes, we used HDF5 as a format for the storage of genomic data. In this recipe, we will consider another format: Parquet, from the Apache Project. There are not, as far as I know, many use cases of Bioinformatics in Parquet (https://parquet.apache.org/), but there are several reasons why this format should be considered. For one, it can be used natively with Apache Spark (see the next recipe), and it can also be far more intelligent than HDF5 in terms of storage of data. Think, for example, faster indexing of data.
In this recipe, we will convert a subset of the HDF5 file that we used in the previous two recipes.
Getting ready
You will need to download the same dataset as in the previous two recipes. At the very least, you are recommended to browse the HDF5 dataset (see the Getting ready section of the first recipe). There is no need to get acquainted with the rest of the code.
We will use Dask-native support for Parquet conversion...