Scheduling tasks with dask.distributed
Dask is extremely flexible in terms of execution: we can execute locally, on a scientific cluster, or on the cloud. That flexibility comes at a cost: it needs to be parameterized. There are several alternatives to configure a Dask schedule and execution, but the most generic is dask.distributed
as it is able to manage different kinds of infrastructure. Because I cannot assume you have access to a cluster or a cloud such as Amazon Web Services (AWS) or GCP, we will be setting up computation on your local machine, but remember that you can set up dask.distributed
on very different kinds of platforms.
Here, we will again compute simple statistics over variants of the Anopheles 1000 Genomes project.
Getting ready
Before we start with dask.distributed
, we should note that Dask has a default scheduler that actually can change depending on the library you are targeting. For example, here is the scheduler for our NumPy example:
import dask...