Exercise: Creating and running jobs on a Dataproc cluster
In this exercise, we will try two different methods to submit a Dataproc job. In the previous exercise, we used the Spark shell to run our Spark syntax, which is common when practicing but not common in real development. Usually, we would only use the Spark shell for initial checking or testing simple things. In this exercise, we will code Spark jobs in editors and submit them as jobs.
Here are the scenarios that we want to try:
- Preparing log data in GCS and HDFS
- Developing Spark ETL from HDFS to HDFS
- Developing Spark ETL from GCS to GCS
- Developing Spark ETL from GCS to BigQuery
Let's look at each of these scenarios in detail.
Preparing log data in GCS and HDFS
The log data is in our GitHub repository, located here:
If you haven't cloned the repository...