Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon

Meet Pypeline, a simple python library for building concurrent data pipelines

Save for later
  • 2 min read
  • 25 Sep 2018

article-image

The Python team came out with a new simple and powerful library called Pypeline, last week for creating concurrent data pipelines. Pypeline has been designed for solving simple to medium data tasks that require concurrency and parallelism. It can be used in places where using frameworks such as Spark or Dask feel unnatural.

Pypeline comprises an easy to use familiar and functional API. It enables building data pipelines using Processes, Threads, and asyncio.Tasks via the exact same API. With Pypeline, you also have control over memory and CPU resources which are used at each stage of your pipeline.

Pypeline Basic Usage


Using Pypeline, you can easily create multi-stage data pipelines with the help of functions such as map, flat_map, filter, etc. To do so, you need to define a computational graph specifying the operations which are to be performed at each stage, the number of resources, and the type of workers you want to use. Pypeline comes with 3 main modules, and each of them uses a different type of worker. To build multi-stage data pipelines, you can use 3 type of workers, namely, processes, threads, and tasks.

Processes


You can create a pipeline based on multiprocessing. Process workers with the help of process module. After this, you can specify the numbers of workers at each stage. The maxsize parameter limits the maximum amount of elements that the stage can hold simultaneously.

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime

Threads and Tasks


Create a pipeline using threading.Thread workers by using the thread module. Additionally, in order to create a pipeline based on asyncio.Task workers, use an asyncio_task module.

Apart from being used to create multi-stage data pipelines, it can also help you create pipelines with the help of the pipe | operator.

For more information, check out the official documentation.

How to build a real-time data pipeline for web developers – Part 1 [Tutorial]

How to build a real-time data pipeline for web developers – Part 2 [Tutorial]

Create machine learning pipelines using unsupervised AutoML [Tutorial]