Exercise – Using Cloud Dataflow to stream data from Pub/Sub to GCS
In this exercise, we will learn how to develop Beam code in Python to create data pipelines. Learning Beam will be challenging at first as you will need to get used to the specific coding pattern, so in this exercise, we will start with a HelloWorld
-level code. But the benefit of using Beam is it's a general framework. Generally, you can create a batch or streaming pipeline with similar code. You can also run using different runners. In this exercise, we will use Direct Runner and Dataflow. As a summary, here are the steps:
- Creating a
HelloWorld
application using Apache Beam - Creating a Dataflow streaming job without aggregation
- Creating a Dataflow streaming job with aggregation
To start, you can check the code for this exercise here: https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform/blob/main/chapter-6/code/beam_helloworld.py