The Map task
The efficiency of the Map phase is decided by the specifications of the job inputs. We saw that having too many small files leads to proliferation of Map tasks because of a large number of splits. Another important statistic to note is the average runtime of a Map task. Too many or too few Map tasks are both detrimental for job performance. Striking a balance between the two is important, much of which depends on the nature of the application and data.
Tip
A rule of thumb is to have the runtime of a single Map task to be around a minute to three minutes, based on empirical evidence.
The dfs.blocksize attribute
The default block size of files in a cluster is overridden in the cluster configuration file, hdfs-site.xml
, generally present in the etc/hadoop
folder of the Hadoop installation. In some cases, a Map task might take only a few seconds to process a block. Giving a bigger block to the Map tasks in such cases is better. This can be done in the following ways:
- Increasing the...