Exercise – using Dataflow to stream data from Pub/Sub to GCS
In this exercise, we will learn how to develop Beam code in Python to create data pipelines. Learning Beam will be challenging at first as you will need to get used to the specific coding pattern. So, in this exercise, we will start with some HelloWorld
code. But the benefit of using Beam is it’s a general framework. Generally, you can create a batch or streaming pipeline with similar code. You can also run using different runners. In this exercise, we will use Direct Runner and Dataflow. As a summary, here are the steps:
- Creating a
HelloWorld
application using Apache Beam. - Creating a Dataflow streaming job without aggregation.
- Creating a Dataflow streaming job with aggregation.
To get started, check out the code for this exercise: