Improving apply performance with Dask, Pandarell, Swifter, and more
Sometimes .apply
is convenient. Various libraries enable parallelizing such operations. There are various mechanisms to do this. The easiest is to try and leverage vectorization. Math operations are vectorized in pandas, if you add a number (say 5) to a numerical series, pandas will not add 5 to each value. Rather it will leverage a feature of modern CPUs to do the operation one time.
If you cannot vectorize, as is the case with our limit_countries
function, you have other options. This section will show a few of them.
Note that you will need to install these libraries as they are not included with pandas.
The examples show limiting values in the country column from the survey data to a few values.
How to do it…
- Import and initialize the Pandarallel library. This library tries to parallelize pandas operations across all available CPUs. Note that this library runs fine on Linux and...