Understanding the concept of the ephemeral cluster
After running the previous exercises, you may notice that Spark is very useful to process data, but it has little to no dependence on Hadoop storage (HDFS). It's very convenient to use data as is from GCS or BigQuery compared to using HDFS.
What does this mean? It means that we may choose not to store any data in the Hadoop cluster (more specifically, in HDFS) and only use the cluster to run jobs. For cost efficiency, we can smartly turn on and turn off the cluster only when a job is running. Furthermore, we can destroy the entire Hadoop cluster when the job is finished and create a new one when we submit a new job. This concept is what's called an ephemeral cluster.
An ephemeral cluster means the cluster is not permanent. A cluster will only exist when it's running jobs. There are two main advantages to using this approach:
- Highly efficient infrastructure cost: With this approach, you don't...