Exercise – Building a data lake on a Dataproc cluster
In this exercise, we will use Dataproc to store and process log data. Log data is a good representation of unstructured data. Organizations often need to analyze log data to understand their users’ behavior.
In the exercise, we will learn how to use HDFS and PySpark using different methods. In the beginning, we will use Cloud Shell to get a basic understanding of the technologies. In the later sections, we will use Cloud Shell Editor and submit jobs to Dataproc. But for the first step, let’s create our Dataproc cluster.
Creating a Dataproc cluster on GCP
To create a Dataproc cluster, access your navigation menu and find Dataproc. If this is the first time you’re accessing this page, please click the Enable API button. After that, you will find the CREATE CLUSTER button. There are two options, Cluster on Compute Engine and Cluster on GKE. Choose Cluster on Compute Engine, which leads to this Create...