Doing parallel computing with Dask
The previous code is still quite slow, so now, we will use parallel processing to accelerate our data analysis. Our first approach will be using Dask, a Python-based library that provides scalable parallelism: most of the code that scales on your laptop will be able to scale on a large cluster. Dask is a fairly low-level and Python-related approach. Later in this chapter, we will discuss an alternative approach that is more high-level and language-agnostic.
Getting ready
We will make a parallel version of the previous code, so you will need to have the same dataset available. We will be using HDF5 processing, so you should be acquainted with the previous recipe anyway.
How to do it...
Take a look at the following steps:
- We will start by doing the necessary imports and checking Dask's version:
from multiprocessing.pool import Pool from math import ceil import numpy as np import h5py import dask import dask.array as da import dask.multiprocessing print(dask...