Understanding how a runner handles state
As we already know, any complex computation will need to group multiple data elements in order to do computation. Because the streaming processing cannot rely on sources being able to replay data (as opposed to pure batch processing, where this property is essential), any updates to the local state during the computation have to be fault-tolerant, and it is the responsibility of a runner to ensure this. The Beam state API is designed precisely to enable this. Any state access is handled by a runner-provided implementation of StateInternals
(and TimerInternals
for timers – in this discussion, we will treat timers as special cases of state, so we will not describe them independently). The StateInternals
instances are responsible for creating the accessors for the state – for example, ValueState
, BagState
, MapState
, and so on. The runner must create and manage these instances to ensure both fault tolerance and consistency...