The Reduce task
The Reduce task is an aggregation step. If the number of Reduce tasks is not specified, the default number is one. The risk of running one Reduce task would mean overloading that particular node. Having too many Reduce tasks would mean shuffle complexity and proliferation of output files that puts pressure on the NameNode. It is important to understand the data distribution and the partitioning function to decide the optimal number of Reduce tasks.
Tip
The ideal setting for each Reduce task to process is a range of 1 GB to 5 GB.
The number of Reduce tasks can be set using the mapreduce.job.reduces
parameter. It can be programmatically set by calling the setNumReduceTasks()
method on the Job
object. There is a cap on the number of Reduce tasks that can be executed by a single node. It is given by the mapreduce.tasktracker.reduce.maximum
property.
Note
The heuristic to determine the right number of reducers is as follows:
0.95 * (nodes * mapreduce.tasktracker.reduce.maximum
)
Alternatively...