Remember that Hadoop in its pure, non-cloud form maintains state in a distributed file system named HDFS. HDFS is on the same set of nodes where the Hadoop jobs actually run; for this reason, Hadoop is said to not separate compute and storage. The compute (Hadoop Jars) and storage (HDFS data) are on the same machines, and the Jars are actually shipped to where the data is.
This was a fine pattern for the old days, but in the cloud world, if you kept your data in HDFS, you would run up an enormous bill. Why? Because in the world of elastic Hadoop clusters, such as Dataproc on the GCP or Elastic MapReduce on AWS, HDFS is going to exist on the persistent disks of the cloud VMs in the cluster. If you keep data in HDFS, you will need those disks to always exist; therefore, the cluster will always be up. You will pay a lot, use only a little, and...