Distributed learning with LightGBM and Dask
Dask is an open-source Python library for distributed computing. It’s designed to integrate seamlessly with existing Python libraries and tools, including scikit-learn and LightGBM. This section looks at running distributed training workloads for LightGBM using Dask.
Dask (https://www.dask.org/) allows you to set up clusters on both a single machine and across many machines. Running Dask on a single machine is the default and requires no setup. However, workloads that run on a single-machine cluster (or scheduler) can readily be run with a distributed scheduler.
Dask offers many ways to run a distributed cluster, including integrating Kubernetes, MPI, or automatic provisioning into a hyperscalar such as AWS or Google Cloud Platform.
When running on a single machine, Dask still distributes the workload across multiple threads, which can significantly speed up workloads.
Dask provides cluster management utility classes to...