Process data in parallel
The processing presented in the previous recipe works well. But it needs to process each file one by one. When we have a small number of files, this may be fine, but with huge numbers of files to handle, this will not be efficient. Each time we will be using a single CPU core, which is not the best for this type of number crunching task.
In this recipe, we will see how to process the files in parallel, making use of all the cores of the computer to speed up the process and greatly increase the throughput.
Getting ready
We will use the resulting CSV file from the previous recipe that receives and transforms logs in the following format:
[<Timestamp>] - SALE - PRODUCT: <product id> - PRICE: <price>
Each line will represent a sale log.
We will use the parse
module and the delorean
module. We should install the modules, adding them to our requirements.txt
file as follows:
$ echo "parse==1.14.0" >>...