Hadoop is a venerable technology now; the grand old man of distributed computing technologies. We won't spend too much time dwelling on Hadoop's internals, but a brief introduction is required for this chapter for it to make sense to folks who are not from a big-data background:
data:image/s3,"s3://crabby-images/d64d7/d64d7453e41ae7609ec308ccd46f44a58c5acfc5" alt=""
The MapReduce programming paradigm is what really matters to a user. It defines a map and reduces tasks using the MapReduce API, and submits them to that part of the Hadoop ecosystem:
data:image/s3,"s3://crabby-images/1555b/1555becdbce1077bcd687b090f1614d139740749" alt=""
When a job gets triggered on the corresponding cluster, this brings YARN into play. This involves prioritizing among different jobs and sharing resources such as compute capacity:
data:image/s3,"s3://crabby-images/ef2d4/ef2d4c1cb0e3764ae74cca894350856e18ffe867" alt=""
YARN is the acronym for Yet Another Resource Negotiator, and it plays the role of a scheduler and resource allocator on the Hadoop cluster. YARN will figure out where and how to run the job:
data:image/s3,"s3://crabby-images/c7f84/c7f847bfdcd30a75afef02a38a7f8704321ad0aa" alt=""
This process also involves...