As mentioned earlier, Google Cloud Dataproc is a managed Spark and Hadoop solution from Google. Its nature of being managed and of being on the cloud gives users the ability to turn the clusters off when they are not required, which saves a lot of cost. So, Dataproc is not only simple and time saving, but it is also cost effective.
Just like other managed services from Google, we can use GCP APIs to interact with Dataproc. We will get into the details later in this chapter. While the initial vision of Dataproc was to provide managed Hadoop and Spark, the current state boasts managed support for open source Apache Hive, Pig, Hadoop, and Spark, and integration with Cloud Storage and BigQuery through connectors, on top of being monitored by Stackdriver. Just like Hadoop, Dataproc also has Master, Client and Worker nodes configurations where Master nodes manage...