Defining droppable data in Beam
This section will be a short return to the material we covered in Chapter 2, Implementing, Testing, and Deploying Basic Pipelines, where we already defined what late data means. To recap – late data is every data element that has a timestamp that is behind the watermark. That is to say, the watermark tells us that we should not receive a data element with a timestamp lower than the watermark, but nevertheless, we do receive such an element. This is perfectly fine, and as already described in Chapter 1, Introduction to Data Processing with Apache Beam, a perfect watermark would introduce unnecessary – or even impractical – latency. However, what we left unanswered is the following question – what happens to data elements that arrive too late? We know that we can define allowed lateness, but what if any data arrives even later? And as always, the answer is – it depends. Luckily, some of the concepts relating to streaming...