Summary
In this chapter, we learned about all the remaining primitive transforms. We now know the details of both the stateless and stateful ParDo
objects. We know the basic life cycle of DoFn
and understand the concept of bundles. We understand why input to stateful ParDo
objects has to be in the form of keyed PCollection
objects. We have seen and understood the details of how states and timers are managed by Beam and how they are delegated to runners in order to ensure fault tolerance. We know how a watermark propagates in transforms in general and what the (stateful) transform's input watermark and output watermark are. We have successfully used our knowledge to create our version of the GroupIntoBatches
transform, which stores data into states before delegating them to an external RPC service.
Next, we focused on handling late and droppable data to be able to avoid data loss. We created one simple and one sophisticated version of a transform process to filter (split) data...