Dataproc
Dataproc is GCP's big-data-managed service for running Hadoop and Spark clusters. Hadoop and Spark are open source frameworks that handle data processing for big data applications in a distributed manner. Essentially, they provide massive storage for data, while also providing enormous processing power to handle concurrent processing tasks.
If we refer to the End-to-end big data solution section of this chapter, Dataproc is also part of the processing stage. It can be compared to Dataflow; however, Dataproc requires us to provision servers, whereas Dataflow is serverless.
Exam Tip
Dataproc should be chosen over Dataflow if we have an existing Hadoop or Spark Cluster. Also, the skill sets of existing resources are needed. If we need to create new pipeline jobs or process streaming data, then we should select Dataflow.
As an alternative to hosting these services on-premises, Google offers Dataproc, which has many advantages – mainly cost-saving, as you...