Running the application on GCP Dataproc
This section will provide a tutorial on how to run the Apex application on a real Hadoop cluster in the cloud. Dataproc (https://cloud.google.com/dataproc/) is one of several options that exist (Amazon EMR is another one, and the instructions here can be easily adapted to EMR as well).
The general instructions on how to work on a cluster were already covered in Chapter 2, Getting Started with Application Development, where a Docker container was used. This section will focus on the differences of adding Apex to an existing multi-node cluster.
To start with, we are heading over to the GCP console (https://console.cloud.google.com/dataproc/clusters) to create a new cluster.
For better illustration we will use the UI, but these steps can be fully automated using the REST API or command line as well:
- The first step is to decide what size of cluster and what type of machines we want. For this example, 3 worker nodes of a small machine type will suffice (for...