Setting up and a quick execution of Apache Beam
What is ApacheBeam? According to the definition from beam.apache.org, Apache Beam is a unified programming model, allowing us to implement batch and streaming data processing jobs that can run on any execution engine.
Why Apache Beam? Because of the following points:
- UNIFIED: Use a single programming model for both batch and streaming use cases.
- PORTABLE: The runtime environment is decoupled from code. Execute pipelines on multiple execution environments, including Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.
- EXTENSIBLE: Write and share new SDKs, IO connectors, and transformation libraries. You can create your own Runner in case to support new runtime.
Beam model
Any transformation or aggregation performed in Beam is called Ptransform
and the connection between these transforms is called PCollection.
PCollection
can be bounded (finite) or unbounded (infinite). One or many sets of PTransform
and PCollection
makes a pipeline in...