Using side outputs
As the name suggests, side inputs are something that is added to the main input from the side, while side outputs are something that is output from the DoFn
object outside of the main PCollection
output. Let's start with the side outputs, as they are more straightforward.
As an example, let's imagine we are processing data coming in as JSON values. We need to parse these messages into an internal object. But what should we do with the values that cannot be parsed because they contain a syntax error? If we do not do any validation before we store them in the stream (topic), then it is certainly possible that we will encounter such a situation. We can silently drop those records, but that is obviously not a great idea, as that could cause hard-to-debug problems. A much better option would be to store these values on the side to be able to investigate and fix them. Therefore, we should aim to do the following:
Figure 3.8 – Main...