Hadoop is a venerable technology now; the grand old man of distributed computing technologies. We won't spend too much time dwelling on Hadoop's internals, but a brief introduction is required for this chapter for it to make sense to folks who are not from a big-data background:
The MapReduce programming paradigm is what really matters to a user. It defines a map and reduces tasks using the MapReduce API, and submits them to that part of the Hadoop ecosystem:
When a job gets triggered on the corresponding cluster, this brings YARN into play. This involves prioritizing among different jobs and sharing resources such as compute capacity:
YARN is the acronym for Yet Another Resource Negotiator, and it plays the role of a scheduler and resource allocator on the Hadoop cluster. YARN will figure out where and how to run the job:
This process also involves...