More and more big data applications rely on streaming data. There are many reasons for this: notably the increasing need for real-time insights where a system must output analytics as new data comes in on-the-fly. We will not spend a lot of time discussing the difference between batch and streaming data, but intuitively, batch data is at rest in a database or a file, whereas streaming data is, well, streaming from a source to a sink.
There is a specific architecture that Google mentions a lot, which combines batch and stream processing into a single pipeline, and it is worth our understanding this architecture, as follows:
In the GCP word, the most common batch data source is GCS (that is, buckets) and the reliable messaging layer is Pub/Sub. Pub/Sub virtually always feeds into Dataflow, which is based on the Apache Beam...