Introducing the primitive PTransform object – GroupByKey
As we have seen, a GroupByKey
transform works in the way illustrated in the following figure:
As in the case of Combine
PTransform
objects, the input stream must be keyed. This is a way of saying that the PCollection must have elements of the KV
type. This is generally true for any stateful operations. The reason for this is that having a state (which cannot be partitioned) divided into smaller, independent sub-states means that it cannot scale and would therefore lead to scalability issues. Therefore, Beam explicitly prohibits this and enforces the use of keyed PCollections for the input of each stateful operation.
The GroupByKey
transform then takes this keyed stream (in Figure 2.14, the key is represented as the shape of the stream element) and creates something that can be viewed as a sub-stream for each key. We can then process elements with a different...