Summary
In this chapter, we have walked through the last fundamental transform of Apache Beam – the splittable DoFn
transform. The transform works as a unifying bridge between batch and streaming sources on one side and allows us to build reusable bounded and unbounded transforms that can be composed to deliver new functionality. As an example, we implemented a StreamingFileRead
transform that composes two splittable DoFn
transforms – one that watches a directory for new files and another that reads the contents of the files and produces PCollection
objects of text lines from them. Note that we might reuse these transforms in different ways. The FileRead
transform can be used to read filenames from Apache Kafka, thereby converting a stream in Kafka containing new filenames to a stream of text lines contained in these files. The DirectoryWatch
transform could be used as an input to a transform that ensures the synchronizing of files between two distinct locations. It is...