Stream-to-stream joins explained
Let's look at Figure 4.6 but modify it a little. Let's say that we want to get the results from the join as quickly as possible. Currently, the latency is defined by the length of the window – because the join is delegated on CoGroupBeyKey
, which, in turn, relies on GroupByKey
, we can only get results when a trigger that's associated with our window function fires. This typically happens at the end of the window (though it can happen sooner, which would then result in duplicates). If we want to avoid deduplication downstream and increase efficiency, because the duplicates can become a performance issue, we have no other option than to decrease the size of the window. At the limit, we end up with a situation like this:
The smaller we make our window, the less data we can join. If the window's duration is zero, will not be able to join any data at all. This...