You're reading from Building Big Data Pipelines with Apache Beam Use a single programming model for both batch and stream data processing

Product type Paperback

Published in Jan 2022

Publisher Packt

ISBN-13 9781800564930

Length 342 pages

Edition 1st Edition

Languages

Python

Tools

Apache Beam

Concepts

Big Data

Author (1):

Jan Lukavský

View More author details

Table of Contents (13) Chapters

Preface

1. Section 1 Apache Beam: Essentials

2. Chapter 1: Introduction to Data Processing with Apache Beam FREE CHAPTER

3. Chapter 2: Implementing, Testing, and Deploying Basic Pipelines

4. Chapter 3: Implementing Pipelines Using Stateful Processing

5. Section 2 Apache Beam: Toward Improving Usability

6. Chapter 4: Structuring Code for Reusability

7. Chapter 5: Using SQL for Pipeline Implementation

8. Chapter 6: Using Your Preferred Language with Portability

9. Section 3 Apache Beam: Advanced Concepts

10. Chapter 7: Extending Apache Beam's I/O Connectors

11. Chapter 8: Understanding How Runners Execute Pipelines

12. Other Books You May Enjoy

Introducing the primitive PTransform object – Partition

The GroupByKey transform creates a set of sub-streams based on a dynamic property of the data – the set of keys of a particular window can be modified during the pipeline execution time. New keys can be created and processed at any time. This creates the complexity mentioned in the previous section – we need to store our data in keyed states and flush them on triggers. A question we might have is – would the task be easier if we knew the exact set of keys upfront, during pipeline construction time?

The answer is yes, and that is why we have a PTransform object called Partition.

Important note

A pipeline is generally divided into three phases during its life cycle: pipeline compile time, pipeline construction time, and pipeline execution time. Compile time refers (as usual) to the time we compile the source to bytecode. Construction time is the time when the pipeline's DAG of transformations...

The rest of the chapter is locked

You're reading from Building Big Data Pipelines with Apache Beam Use a single programming model for both batch and stream data processing

Table of Contents (13) Chapters

Introducing the primitive PTransform object – Partition

Authors (1)

Personalised recommendations for you

You're reading from Building Big Data Pipelines with Apache Beam Use a single programming model for both batch and stream data processing

Table of Contents (13) Chapters Close

Introducing the primitive PTransform object – Partition

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you

Table of Contents (13) Chapters