Building our batch data pipeline
In this section, we will create our Spark job and run it on Google Cloud Dataproc and will use GCC to orchestrate our job. What this means is that we can get GCC to automatically run our job every day. Each time it runs our job, it will create a Dataproc cluster, execute our Spark job, and then delete the Dataproc cluster when our job completes. This is a standard best practice that companies use to save money because you should not have computing resources running when you are not using them. The architecture of our pipeline on Google Cloud is shown in Figure 6.5:
Figure 6.5: Batch data pipeline architecture
Let’s begin by setting up Cloud Composer.
Cloud Composer
In this section, we will set up Cloud Composer to schedule and run our batch data processing pipeline.
Cloud Composer environment
Everything we do in Cloud Composer happens within a Cloud Composer environment. To set up our Cloud Composer environment...