The Python team came out with a new simple and powerful library called Pypeline, last week for creating concurrent data pipelines. Pypeline has been designed for solving simple to medium data tasks that require concurrency and parallelism. It can be used in places where using frameworks such as Spark or Dask feel unnatural.
Pypeline comprises an easy to use familiar and functional API. It enables building data pipelines using Processes, Threads, and asyncio.Tasks via the exact same API. With Pypeline, you also have control over memory and CPU resources which are used at each stage of your pipeline.
Using Pypeline, you can easily create multi-stage data pipelines with the help of functions such as map, flat_map, filter, etc. To do so, you need to define a computational graph specifying the operations which are to be performed at each stage, the number of resources, and the type of workers you want to use. Pypeline comes with 3 main modules, and each of them uses a different type of worker. To build multi-stage data pipelines, you can use 3 type of workers, namely, processes, threads, and tasks.
You can create a pipeline based on multiprocessing. Process workers with the help of process module. After this, you can specify the numbers of workers at each stage. The maxsize parameter limits the maximum amount of elements that the stage can hold simultaneously.
Create a pipeline using threading.Thread workers by using the thread module. Additionally, in order to create a pipeline based on asyncio.Task workers, use an asyncio_task module.
Apart from being used to create multi-stage data pipelines, it can also help you create pipelines with the help of the pipe | operator.
For more information, check out the official documentation.
How to build a real-time data pipeline for web developers – Part 1 [Tutorial]
How to build a real-time data pipeline for web developers – Part 2 [Tutorial]
Create machine learning pipelines using unsupervised AutoML [Tutorial]