Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Building Big Data Pipelines with Apache Beam
Building Big Data Pipelines with Apache Beam

Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing

eBook
NZ$39.99 NZ$57.99
Paperback
NZ$71.99
Subscription
Free Trial

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Table of content icon View table of contents Preview book icon Preview Book

Building Big Data Pipelines with Apache Beam

Chapter 1: Introduction to Data Processing with Apache Beam

Data. Big data. Real-time data. Data streams. Many buzzwords to describe many things, and yet they have many common properties. Mind-blowing applications can be developed from the successful application of (theoretically) simple logic – take data and produce knowledge. However, a simple-sounding task can turn out to be difficult when the amount of data needed to produce knowledge is huge (and still growing). Given the vast volumes of data produced by humanity every day, which tools should we choose to turn our simple logic into scalable solutions? That is, solutions that protect our investment in creating the data extraction logic, even in the presence of new requirements arising or changing on a daily basis, and new data processing technologies being created? This book focuses on why Apache Beam might be a good solution to these challenges, and it will guide you through the Beam learning process.

In this chapter, we will cover the following topics:

  • Why Apache Beam?
  • Writing your first pipeline
  • Running a pipeline against streaming data
  • Exploring the key properties of Unbounded data
  • Measuring the event time progress inside data streams
  • Assigning data to windows
  • Unifying batch and streaming data processing

Technical requirements

In this chapter, we will introduce some elementary pipelines written using Beam's Java Software Development Kit (SDK).

We will use the code located in the GitHub repository for this book: https://github.com/PacktPublishing/Building-Big-Data-Pipelines-with-Apache-Beam.

We will also need the following tools to be installed:

  • Java Development Kit (JDK) 11 (possibly OpenJDK 11), with JAVA_HOME set appropriately
  • Git
  • Bash

    Important note

    Although it is possible to run many tools in this book using the Windows shell, we will focus on using Bash scripting only. We hope Windows users will be able to run Bash using virtualization or Windows Subsystem for Linux (or any similar technology).

First of all, we need to clone the repository:

  1. To do this, we create a suitable directory, and then we run the following command:
    $ git clone https://github.com/PacktPublishing/Building-Big-Data-Pipelines-with-Apache-Beam.git
  2. This will result in a directory, Building-Big-Data-Pipelines-with-Apache-Beam, being created in the working directory. We then run the following command in this newly created directory:
    $ ./mvnw clean install 

Throughout this book, the $ character will denote a Bash shell. Therefore, $ ./mvnw clean install would mean to run the ./mvnw command in the top-level directory of the git clone (that is, Building-Big-Data-Pipelines-with-Apache-Beam). By using chapter1$ ../mvnw clean install, we mean to run the specified command in the subdirectory called chapter1.

Why Apache Beam?

There are two basic questions we might ask when considering a new technology to learn and apply in practice:

  • What problem am I struggling with that the new technology can help me solve?
  • What would the costs associated with the technology be?

Every sound technology has a well-defined selling point – that is, something that justifies its existence in the presence of competing technologies. In the case of Beam, this selling point could be reduced to a single word: portability. Beam is portable on several layers:

  • Beam's pipelines are portable between multiple runners (that is, a technology that executes the distributed computation described by a pipeline's author).
  • Beam's data processing model is portable between various programming languages.
  • Beam's data processing logic is portable between bounded and unbounded data.

Each of these points deserves a few words of explanation. By runner portability, we mean the possibility to run existing pipelines written in one of the supported programming languages (for instance, Java, Python, Go, Scala, or even SQL) against a data processing engine that can be chosen at runtime. A typical example of a runner would be Apache Flink, Apache Spark, or Google Cloud Dataflow. However, Beam is by no means limited to these; new runners are created as new technologies arise, and it's very likely that many more will be developed.

When we say Beam's data processing model is portable between various programming languages, we mean it has the ability to provide support for multiple SDKs, regardless of the language or technology used by the runner. This way, we can code Beam pipelines in the Go language, and then run these against the Apache Flink Runner, written in Java.

Last but not least, the core of Apache Beam's model is designed so that it is portable between bounded and unbounded data. Bounded data is what was historically called batch processing, while unbounded data refers to real-time processing (that is, an application crunching live data as it arrives in the system and producing a low-latency output).

Putting these pieces together, we can describe Beam as a tool that lets you deal with your big data architecture with the following vision:

Choose your preferred language, write your data processing pipeline, run this pipeline using a runner of your choice, and do all of this for both batch and real-time data at the same time.

Because everything comes at a price, you should expect to pay for flexibility like this – this price would be a somewhat bigger overhead in terms of CPU and/or memory usage. The Beam community works hard to make this overhead as small as possible, but the chances are that it will never be zero.

If all of this sounds compelling to you, then we are ready to start a journey exploring Apache Beam!

Writing your first pipeline

Let's jump right into writing our first pipeline. The first part of this book will focus on Beam's Java SDK. We assume that you are familiar with programming in Java and building a project using Apache Maven (or any similar tool). The following code can be found in the com.packtpub.beam.chapter1.FirstPipeline class in the chapter1 module in the GitHub repository. We would like you to go through all of the code, but we will highlight the most important parts here:

  1. We need some (demo) input for our pipeline. We will read this input from the resource called lorem.txt. The code is standard Java, as follows:
    ClassLoader loader = FirstPipeline.class.getClassLoader();
    String file = loader.getResource("lorem.txt").getFile();
    List<String> lines = Files.readAllLines(
        Paths.get(file), StandardCharsets.UTF_8);
  2. Next, we need to create a Pipeline object, which is a container for a Directed Acyclic Graph (DAG) that represents the data transformations needed to produce output from input data:
    Pipeline pipeline = Pipeline.create();

    Important note

    There are multiple ways to create a pipeline, and this is the simplest. We will see different approaches to pipelines in Chapter 2, Implementing, Testing, and Deploying Basic Pipelines.

  3. After we create a pipeline, we can start filling it with data. In Beam, data is represented by a PCollection object. Each PCollection object (that is, parallel collection) can be imagined as a line (an edge) connecting two vertices (PTransforms, or parallel transforms) in the pipeline's DAG.
  4. Therefore, the following code creates the first node in the pipeline. The node is a transform that takes raw input from the list and creates a new PCollection:
    PCollection<String> input = pipeline.apply(Create.of(lines));

    Our DAG will then look like the following diagram:

    Figure 1.1 – A pipeline containing a single PTransform

    Figure 1.1 – A pipeline containing a single PTransform

  5. Each PTransform can have one main output and possibly multiple side output PCollections. Each PCollection has to be consumed by another PTransform or it might be excluded from the execution. As we can see, our main output (PCollection of PTransform, called Create) is not presently consumed by any PTransform. We connect PTransform to a PCollection by applying this PTransform on the PCollection. We do that by using the following code:
    PCollection<String> words = input.apply(Tokenize.of());

    This creates a new PTransform (Tokenize) and connects it to our input PCollection, as shown in the following figure:

    Figure 1.2 – A pipeline with two PTransforms

    Figure 1.2 – A pipeline with two PTransforms

    We'll skip the details of how the Tokenize PTransform is implemented for now (we will return to that in Chapter 5, Using SQL for Pipeline Implementation, which describes how to structure code in general). Currently, all we have to remember is that the Tokenize PTransform takes input lines of text and splits each line into words, which produces a new PCollection that contains all of the words from all the lines of the input PCollection.

  6. We finish the pipeline by adding two more PTransforms. One will produce the well-known word count example, so popular in every big data textbook. And the last one will simply print the output PCollection to standard output:
    PCollection<KV<String, Long>> result =
        words.apply(Count.perElement());
    result.apply(PrintElements.of());

    Details of both the Count PTransform (which is Beam's built-in PTransform) and PrintElements (which is a user-defined PTransform) will be discussed later. For now, if we focus on the pipeline construction process, we can see that our pipeline looks as follows:

    Figure 1.3 – The final word count pipeline

    Figure 1.3 – The final word count pipeline

  7. After we define this pipeline, we should run it. This is done with the following line:
    pipeline.run().waitUntilFinish();

    This causes the pipeline to be passed to a runner (configured in the pipeline; if omitted, it defaults to a runner available on Classpath). The standard default runner is the DirectRunner, which executes the pipeline in the local Java Virtual Machine (JVM) only. This runner is mostly only suitable for testing, as we will see in the next chapter.

  8. We can run this pipeline by executing the following command in the code examples for the chapter1 module, which will yield the expected output on standard output:
    chapter1$ ../mvnw exec:java \
        -Dexec.mainClass=com.packtpub.beam.chapter1.FirstPipeline

    Important note

    The ordering of output is not defined and is likely to vary over multiple runs. This is to be expected and is due to the fact that the pipeline underneath is executed in multiple threads.

  9. A very useful feature is that the application of PTransform to PCollection can be chained, so the preceding code can be simplified to the following:
    ClassLoader loader = ...
    FirstPipeline.class.getClassLoader();
    String file =
       loader.getResource("lorem.txt").getFile();
    List<String> lines = Files.readAllLines(
        Paths.get(file), StandardCharsets.UTF_8);
    Pipeline pipeline = Pipeline.create();
    pipeline.apply(Create.of(lines))
        .apply(Tokenize.of())
        .apply(Count.perElement())
        .apply(PrintElements.of());
    pipeline.run().waitUntilFinish();

    When used with care, this style greatly improves the readability of the code.

Now that we have written our first pipeline, let's see how to port it from a bounded data source to a streaming source!

Running our pipeline against streaming data

Let's discuss how we can change this code to enable it to run against a streaming data source. We first have to define what we mean by a data stream. A data stream is a continuous flow of data without any prior information about the cardinality of the dataset. The dataset can be either finite or infinite, but we do not know which in advance. Because of this property, the streaming data is often called unbounded data, because, as opposed to bounded data, no prior bounds regarding the cardinality of the dataset can be made.

The absence of bounds is one property that makes the processing of data streams trickier (the other is that bounded data sets can be viewed as static, while unbounded data is, by definition, changing over time). We'll investigate these properties later in this chapter, and we'll see how we can leverage them to define a Beam unified model for data processing.

For now, let's imagine our pipeline is given a source, which gives one line of text at a time but does not give any signal of how many more elements there are going to be. How do we need to change our data processing logic to extract information from such a source?

  1. We'll update our pipeline to use a streaming source. To do this, we need to change the way we created our input PCollection of lines coming from a List via Create PTransform to a streaming input. Beam has a utility for this called TestStream, which works as follows.

    Create a TestStream (a utility that emulates an unbounded data source). The TestStream needs a Coder (details of which will be skipped for now and will be discussed in Chapter 2, Implementing, Testing, and Deploying Basic Pipelines):

    TestStream.Builder<String> streamBuilder =
        TestStream.create(StringUtf8Coder.of());
  2. Next, we fill the TestStream with data. Note that we need a timestamp for each record so that the TestStream can emulate a real stream, which should have timestamps assigned for every input element:
    Instant now = Instant.now();
    // add all lines with timestamps to the TestStream
    List<TimestampedValue<String>> timestamped =
        IntStream.range(0, lines.size())
            .mapToObj(i -> TimestampedValue.of(
               lines.get(i), now.plus(i)))
            .collect(Collectors.toList());
    for (TimestampedValue<String> value : timestamped) {
      streamBuilder = streamBuilder.addElements(value);
    }
  3. Then, we will apply this to the pipeline:
    // create the unbounded PCollection from TestStream
    PCollection<String> input =
        pipeline.apply(streamBuilder.advanceWatermarkToInfinity());

    We encourage you to investigate the complete source code of the com.packtpub.beam.chapter1.MissingWindowPipeline class to make sure everything is properly understood in the preceding example.

  4. Next, we run the class with the following command:
    chapter1$ ../mvnw exec:java \
        -Dexec.mainClass=\
            com.packtpub.beam.chapter1.MissingWindowPipeline

    This will result in the following exception:

    java.lang.IllegalStateException: GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger. Use a Window.into or Window.triggering transform prior to GroupByKey.

    This is because we need a way to identify the (at least partial) completeness of the data. That is to say, the data needs (explicit or implicit) markers that define a condition that (when met) triggers a completion of a computation and outputs data from a PTransform. The computation can then continue from the values already computed or be reset to the initial state.

    There are multiple ways to define such a condition. One of them is to define time-constrained intervals called windows. A time-constrained window might be defined as data arriving within a specific time interval – for example, between 1 P.M. and 2 P.M.

  5. As the exception suggests, we need to define a window to be applied to the input data stream in order to complete the definition of the pipeline. The definition of a Window is somewhat complex, and we will dive into all its parameters later in this book. But for now, we'll define the following Window:
    PCollection<String> windowed =
        words.apply(
            Window.<String>into(new GlobalWindows())
                .discardingFiredPanes()
                .triggering(AfterWatermark.pastEndOfWindow()));

    This code applies Window.into PTransform by using GlobalWindows, which is a specific Window that contains whole data (which means that it can be viewed as a Window containing the whole history and future of the universe).

    The complete code can be viewed in the com.packtpub.beam.chapter1.FirstStreamingPipeline class.

  6. As usual, we can run this code using the following command:
    chapter1$ ../mvnw exec:java \
        -Dexec.mainClass=\
            com.packtpub.beam.chapter1.FirstStreamingPipeline

    This results in the same outcome as in the first example and with the same caveat – the order of output is not defined and will vary over multiple runs of the same code against the same data. The values will be absolutely deterministic, though.

Once we have successfully run our first streaming pipeline, let's dive into what exactly this streaming data is, and what to expect when we try to process it!

Exploring the key properties of unbounded data

In the previous section, we successfully ran our sample pipeline against simulated unbounded data. We have seen that only a slight modification had to be made for the pipeline to produce output in the streaming case. Let's now dive a little deeper into understanding why this modification was necessary and how to code our pipelines to be portable from the beginning.

First of all, we need to define a notion of time. In our everyday life, time is a common thing we don't think that much about. We know what time it is at the moment, and we react to events that happen (more or less) instantly. We can plan for the future, but we cannot change the past.

When it comes to data processing, things change significantly. Let's imagine a smart home application that reads data from various sensors and acts based on the values it receives. Such an application is depicted in the following diagram:

Figure 1.4 – A simple sensor data processing application

Figure 1.4 – A simple sensor data processing application

The application reads a stream of incoming sensor data, reads the state associated with each device and/or the other settings related to the data being processed, (possibly) updates the state, and (possibly) outputs some resulting events or commands (for example, turn on a light if some condition is met).

Now, let's imagine we want to make modifications to the application logic. We add some new smart features, and we would like to know how the logic would behave if it had been fed with some historical events that we stored for a purpose like this. We cannot simply exchange the logic and push historical data through it because that would result in incorrect modifications of the state – the state might have been changed from the time we recorded our historical data. We see that we cannot mix two times – the time at which we process our data and the time at which the data originated. We usually call these two times the processing time and the event time. The first one is the time we see on our clock when an event arrives, while the other is the time at which the event occurred. A beautiful demonstration of these two times is depicted in the following table:

Figure 1.5 – Star Wars episodes' processing and event times

Figure 1.5 – Star Wars episodes' processing and event times

For those who are not familiar with the Star Wars saga, the processing time here represents the order in which the movies were released, while the event time represents the order of the episodes in the chronology of the story. By defining the event time and the processing time, we are able to explain another weird aspect of the streaming world – each data stream is inevitably unordered in terms of its event time. What do we mean by this? And why should this be inevitable?

The out-of-orderness of a data stream is shown in the following diagram:

Figure 1.6 – Unordered data stream

Figure 1.6 – Unordered data stream

The circles represent data points (events), the x-axis is processing time, and the y-axis is event time. The upper-left half of the square should be empty under normal circumstances because that area represents events coming from the future – events with a higher event time than the current processing time. The rest of the data points represent events that arrive with a lower or higher delay from the time they occurred (event time). The vast majority of the delay is caused by technical reasons, such as queueing in the network stack, buffering, machine clocks being out of sync, or even outages of some parts of a distributed system. But there are also physical reasons why this happens – a vastly delayed data point is what you see if you look at the sky at night. The light coming from the stars we see with our naked eye is delayed by as much as a thousand years. Because even our physical reality works like this, the out-of-orderness is to be expected and has to be considered 'normal' in any stream processing.

So, we defined what the event and processing times are. We have a clock for measuring processing time. But what about event time? How do we measure that? Let's find out!

Measuring event time progress inside data streams

As we have shown, data streams are naturally unordered in terms of event time. Nevertheless, we need a way of measuring the completeness of our computation. Here is where another essential principle appears – watermarks.

A watermark is a (heuristic) algorithm that gives us an estimation of how far we have got in the event time domain. A perfect watermark gives the highest possible time (T) that guarantees no further data arrives with an event time < T. Let's demonstrate this with the following diagram:

Figure 1.7 – Watermark and late data

Figure 1.7 – Watermark and late data

We can see from Figure 1.7 that the watermark is a boundary that moves along with the data points, ideally leaving all of the data on its left side. All data points lying on the right side (with a processing time on the x-axis) are called late data. Such data typically requires special handling that will be described in Chapter 2, Implementing, Testing, and Deploying Basic Pipelines.

There are many ways to implement watermarks. They are typically generated at the source and propagated through the pipeline. We will discuss the details of the implementation of some watermarks in later chapters dedicated to I/O connectors. Typically, users do not have to generate watermarks themselves, although it is very useful to have a very good understanding of the concept.

States and triggers

Each computation on a data stream that takes into account more than a single isolated event needs a state. The state holds (accumulates) values derived from the so-far processed stream elements. Let's imagine we want to calculate the current number of elements in a stream. We would do that as follows:

Figure 1.8 – Counting elements in a stream

Figure 1.8 – Counting elements in a stream

The computational logic is straightforward – take the incoming element, read the current number of elements from the state, increment that number by one, store the new count into the state, and emit the current value to the output stream.

As simple as this sounds, the overall picture becomes quite complex when we consider that what we are building is an application that is supposed to run for a very long time (theoretically, forever). Such a long-running application will necessarily face disruptions caused by failing hardware, software, or necessary upgrades of the application itself. Therefore, each (sane) stream processing application has to be fault-tolerant by design.

Ensuring fault-tolerance puts specific requirements on the state and on the stream itself. Specifically, we must ensure the following:

  • We must keep the state in secure, fault-tolerant storage.
  • We must retain the ability to restore both the state and the stream to a defined position.

Both of these requirements dictate that every fault-tolerant stream processing engine must provide state management and state access APIs, and it must incorporate the state into its core concepts. The same holds true for Beam, and we'll dive deeper into the state concept in the following chapters.

Our element-count example raises another question: when should we output the resulting count? In the preceding example, we output the current count for each input element. This might not be adequate for every application. Other options would be to output the current value in the following ways:

  • In fixed periods of processing time (for instance, every 5 seconds)
  • When the watermark reaches a certain time (for instance, the output count when our watermark signals that we have processed all data up to 5 P.M.)
  • When a specific condition is met in the data

Such emitting conditions are called triggers. Each of these possibilities represents one option: a processing time trigger, an event time trigger, and a data-driven trigger. Beam provides full support for processing and event time triggers and supports one data-driven trigger, which is a trigger that outputs after a specific number of elements (for example, after every 10 elements).

If you remember, we have already seen a declaration of a trigger:

PCollection<String> windowed =
    words.apply(
        Window.<String>into(new GlobalWindows())
            .discardingFiredPanes()
            .triggering(
                AfterWatermark.pastEndOfWindow()));

This is one of many event time triggers, which specifies that we want to output a result when our watermark reaches a time, and it is defined as an end-of-window. We'll dive deeper into this in later chapters when we discuss the concept of windows.

Timers

Both event time and processing time triggers require an additional stream processing concept. This concept is a timer. A timer is a tool that lets an application specify a moment in either the processing time or event time domain, and when that moment is reached, an application-defined callback hook is called. For the same reason as with states, timers also need to be fault-tolerant (that is, they have to be kept in fault-tolerant storage). Beam is purposely designed so that there is actually no way to access a watermark directly, and the only way of observing a watermark is by using event time timers. We will investigate timers in more detail in Chapter 3, Implementing Pipelines Using Stateful Processing.

We now know that a streaming data processing engine needs to manage the application's state for us, but what is the life cycle of such a state? Let's find out!

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Understand how to improve usability and productivity when implementing Beam pipelines
  • Learn how to use stateful processing to implement complex use cases using Apache Beam
  • Implement, test, and run Apache Beam pipelines with the help of expert tips and techniques

Description

Apache Beam is an open source unified programming model for implementing and executing data processing pipelines, including Extract, Transform, and Load (ETL), batch, and stream processing. This book will help you to confidently build data processing pipelines with Apache Beam. You’ll start with an overview of Apache Beam and understand how to use it to implement basic pipelines. You’ll also learn how to test and run the pipelines efficiently. As you progress, you’ll explore how to structure your code for reusability and also use various Domain Specific Languages (DSLs). Later chapters will show you how to use schemas and query your data using (streaming) SQL. Finally, you’ll understand advanced Apache Beam concepts, such as implementing your own I/O connectors. By the end of this book, you’ll have gained a deep understanding of the Apache Beam model and be able to apply it to solve problems.

Who is this book for?

This book is for data engineers, data scientists, and data analysts who want to learn how Apache Beam works. Intermediate-level knowledge of the Java programming language is assumed.

What you will learn

  • Understand the core concepts and architecture of Apache Beam
  • Implement stateless and stateful data processing pipelines
  • Use state and timers for processing real-time event processing
  • Structure your code for reusability
  • Use streaming SQL to process real-time data for increasing productivity and data accessibility
  • Run a pipeline using a portable runner and implement data processing using the Apache Beam Python SDK
  • Implement Apache Beam I/O connectors using the Splittable DoFn API

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jan 21, 2022
Length: 342 pages
Edition : 1st
Language : English
ISBN-13 : 9781800566569
Category :
Languages :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning

Product Details

Publication date : Jan 21, 2022
Length: 342 pages
Edition : 1st
Language : English
ISBN-13 : 9781800566569
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just NZ$7 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just NZ$7 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total NZ$ 284.97
The Machine Learning Solutions Architect Handbook
NZ$131.99
Hands-On Data Preprocessing in Python
NZ$80.99
Building Big Data Pipelines with Apache Beam
NZ$71.99
Total NZ$ 284.97 Stars icon

Table of Contents

12 Chapters
Section 1 Apache Beam: Essentials Chevron down icon Chevron up icon
Chapter 1: Introduction to Data Processing with Apache Beam Chevron down icon Chevron up icon
Chapter 2: Implementing, Testing, and Deploying Basic Pipelines Chevron down icon Chevron up icon
Chapter 3: Implementing Pipelines Using Stateful Processing Chevron down icon Chevron up icon
Section 2 Apache Beam: Toward Improving Usability Chevron down icon Chevron up icon
Chapter 4: Structuring Code for Reusability Chevron down icon Chevron up icon
Chapter 5: Using SQL for Pipeline Implementation Chevron down icon Chevron up icon
Chapter 6: Using Your Preferred Language with Portability Chevron down icon Chevron up icon
Section 3 Apache Beam: Advanced Concepts Chevron down icon Chevron up icon
Chapter 7: Extending Apache Beam's I/O Connectors Chevron down icon Chevron up icon
Chapter 8: Understanding How Runners Execute Pipelines Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Most Recent
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.7
(9 Ratings)
5 star 55.6%
4 star 11.1%
3 star 0%
2 star 11.1%
1 star 22.2%
Filter icon Filter
Most Recent

Filter reviews by




Paul Henry Sep 08, 2024
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
As most data engineers, I code on python. So I just assumed the book would have examples in python. I was disappointed to see the book only in java, so it almost completely useless to me. Reading several chapters, I noticed the material was also over technical and not well explained. For example, do I really need to know that a window contains state? It's pretty obvious by examples that would be the case.
Amazon Verified review Amazon
Rachel Jun 15, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The book explains Apache Beam distributed data processing system for batch and streaming processing, from introducing learning examples to diving deep into advanced topics.I liked that book provides Github repo with all you need to focus on learning Apache Beam.The first two chapters introduce foundational concepts for pipelines, data transforms, batch, streaming, how to think of event and processing times, and windowed processing. Next, book covers stateful processing that can be used for processing data with external RPC services, using side inputs to enrich data. While the initial introduction is using Java, then the book expands to using Apache Beam with SQL, Python, and a concept of cross-language pipelines.Advanced topics include exploring building your custom I/O using splittable DoFn, understanding how Beam runners execute pipelines at scale, observability etc. - this all helps to deeper understand Apache Beam and apply that to help optimize data pipelines.
Amazon Verified review Amazon
David May 22, 2022
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
In general it's a useful book for learning Apache Beam. It contains a lot of information and some good walkthrough use cases that will show you how to solve some problems.But it definitely has some serious flaws.(1) If you don't already know the concepts of streaming well the book will confuse you because the explanations of these concepts are so short and bad that you will get lost.(2) Also, some of the code that is shown in the book is completely new to you and the way the author explains them don't help you to understand what the theory he was talking about has to do with the code.I had to go a step back and read a book on streaming concepts first to really understand what is going on in this book.The two star ratings comes from the fact that I think a book should be clear to understand and the writing should be so concise that you understand the content if you have the required knowledge that the book states. However, this was not the case here.
Amazon Verified review Amazon
Nelson J. Perez Mar 11, 2022
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
The author it's definitely an expert on Apache Beam and other technologies that he uses in the exercises.But the book it's hard to swallow just like kale salad with no dressing.The problem is that he gives good examples but they are half baked in the book so you need to look at the source code in github. It's definitely a hands on book. But seems like he skim the basics like explaning the actual parameters and objects in Apache Beam.He introduces examples with unexplained code early on but does not explain quite well the code he just says go to the github repository to see the full code.Some of the github code doesn't work out of the box so you'll need to massage it a little to make work I'm still could not make chapter 6 code work so I had to disable it from the build and Docker file.Overall you need to read things twice or go back to understand previous examples. But you'll definitely need to look understand and play with the source code.I think he could have done a much better job explaining the basics which are missing in the book. A big plus is that it has lots of practical examples. Just not detailed enough explanations.Again he is an expert for sure but he skimmed through the basics sort of forgetting that the reader needs those to understand the rest of the book.
Amazon Verified review Amazon
Antonio Cachuan Feb 22, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
A great opportunity to increase your skill in doing Apache Beam pipelines. I think this book is a good pocket reference if you are consolidating your Beam knowledge. Let me share the pros and cons after reading it:Pros-The book starts by giving you an overview of all the requirements for setting up an environment, and all the examples (mainly using Java) are located in a repository. I was new to using Minikube but I was able to run the examples without major difficulties. -Continuous giving details about Beam concepts like PCollections, PTransform, and others. Here I have to admit that the explanation of the concept of Window helped me to finally clarify some personal doubts in this field.-I agree with how was presented the batch to the streaming world "batch is a special case of streaming" and the examples that followed support the author's idea.-All the chapters include examples and Jan explains line-by-line the main parts of his code.-In chapters 8 and 4, the author made a great work explaining all about Runners and PTransform respectively.Cons-I would like to get more advice for deploying a pipeline in production. For example, including tips and more troubleshooting or including a deployment sample in any cloud provider.-Not exactly a con, but you need some experience in Java, and some background developing pipelines.ConclusionsMy final thought is that if you want to go serious in Apache Beam purchasing this book is a no doubt decision.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.