You're reading from Building Big Data Pipelines with Apache Beam Use a single programming model for both batch and stream data processing

Product type Paperback

Published in Jan 2022

Publisher Packt

ISBN-13 9781800564930

Length 342 pages

Edition 1st Edition

Languages

Python

Tools

Apache Beam

Concepts

Big Data

Author (1):

Lukavský

View More author details

Table of Contents (13) Chapters

Preface

1. Section 1 Apache Beam: Essentials

2. Chapter 1: Introduction to Data Processing with Apache Beam FREE CHAPTER

3. Chapter 2: Implementing, Testing, and Deploying Basic Pipelines

4. Chapter 3: Implementing Pipelines Using Stateful Processing

5. Section 2 Apache Beam: Toward Improving Usability

6. Chapter 4: Structuring Code for Reusability

7. Chapter 5: Using SQL for Pipeline Implementation

8. Chapter 6: Using Your Preferred Language with Portability

9. Section 3 Apache Beam: Advanced Concepts

10. Chapter 7: Extending Apache Beam's I/O Connectors

11. Chapter 8: Understanding How Runners Execute Pipelines

12. Other Books You May Enjoy

Why Apache Beam?

There are two basic questions we might ask when considering a new technology to learn and apply in practice:

What problem am I struggling with that the new technology can help me solve?
What would the costs associated with the technology be?

Every sound technology has a well-defined selling point – that is, something that justifies its existence in the presence of competing technologies. In the case of Beam, this selling point could be reduced to a single word: portability. Beam is portable on several layers:

Beam's pipelines are portable between multiple runners (that is, a technology that executes the distributed computation described by a pipeline's author).
Beam's data processing model is portable between various programming languages.
Beam's data processing logic is portable between bounded and unbounded data.

Each of these points deserves a few words of explanation. By runner portability, we mean the possibility to run existing pipelines written in one of the supported programming languages (for instance, Java, Python, Go, Scala, or even SQL) against a data processing engine that can be chosen at runtime. A typical example of a runner would be Apache Flink, Apache Spark, or Google Cloud Dataflow. However, Beam is by no means limited to these; new runners are created as new technologies arise, and it's very likely that many more will be developed.

When we say Beam's data processing model is portable between various programming languages, we mean it has the ability to provide support for multiple SDKs, regardless of the language or technology used by the runner. This way, we can code Beam pipelines in the Go language, and then run these against the Apache Flink Runner, written in Java.

Last but not least, the core of Apache Beam's model is designed so that it is portable between bounded and unbounded data. Bounded data is what was historically called batch processing, while unbounded data refers to real-time processing (that is, an application crunching live data as it arrives in the system and producing a low-latency output).

Putting these pieces together, we can describe Beam as a tool that lets you deal with your big data architecture with the following vision:

Choose your preferred language, write your data processing pipeline, run this pipeline using a runner of your choice, and do all of this for both batch and real-time data at the same time.

Because everything comes at a price, you should expect to pay for flexibility like this – this price would be a somewhat bigger overhead in terms of CPU and/or memory usage. The Beam community works hard to make this overhead as small as possible, but the chances are that it will never be zero.

If all of this sounds compelling to you, then we are ready to start a journey exploring Apache Beam!

You're reading from Building Big Data Pipelines with Apache Beam Use a single programming model for both batch and stream data processing

Table of Contents (13) Chapters

Why Apache Beam?

Authors (1)

Personalised recommendations for you