Introduction to the Dask library
You can think of Dask as one of the most revolutionary Python libraries for data processing at scale. If you are a regular pandas and NumPy user, you'll love Dask. The library allows you to work with data NumPy and pandas doesn't allow because they don't fit into the RAM.
Dask supports both NumPy array and pandas DataFrame data structures, so you'll quickly get up to speed with it. It can run either on your computer or a cluster, making it that much easier to scale. You only need to write the code once and then choose the environment that you'll run it in. It's that simple.
One other thing to note is that Dask allows you to run code in parallel with minimal changes. As you saw earlier, processing things in parallel means the execution time decreases, which is generally the behavior we want. Later, you'll learn how parallelism in Dask works with dask.delayed
.
To get started, you'll have to install the...