Task 21 – Implementing our own splittable DoFn – a streaming file source
In this task, we will see how to implement all aspects of a splittable DoFn
process and we will see how to use its power and extensibility. So, let's create a streaming source from a plain filesystem! We will explain what we mean by that in the following problem definition.
The problem definition
We want to create a streaming-like source from a directory on a filesystem that will work by watching a specified directory for new files. Once a new file appears, it will grab it and output its content split into individual (text) lines for downstream processing. The source will compute a watermark as a maximal timestamp for all of the files in the specified directory. For simplicity, ignore recursive sub-directories and treat all files as immutable.
Let's illustrate that in the following discussion for clarity.
Discussing the problem decomposition
The problem effectively consists...