Logging Dataflow pipelines
Dataflow pipelines provide a stream of data or batch processing capabilities at scale. GCP’s Dataflow pipeline is based on Apache Beam. Logs can be streamed at variable volumes in near real time using Dataflow applications.
Any actions performed on GCP Dataflow are recorded by default in Logs Explorer. Through Logs Explorer, investigators can detect any changes to the Dataflow parameters or whether unauthorized users altered the pipeline.
Note that a Docker instance forms the base of any Dataflow pipeline’s operations. Therefore, investigators must also investigate the logs emitted by the GKE cluster and GCE Instance Group Manager. GCP relies on Instance Group Manager to create multiple managed VMs that run the containers (GKE) to handle instance resourcing and deploying VMs automatically.
The following figure outlines some sample resources required for successful Dataflow pipeline execution. Like Syslog, Dataflow events are tagged...