MapReduce input
The Map step of a MapReduce job hinges on the nature of the input provided to the job. The Map step provides maximum parallelism gains, and crafting this step smartly is important for job speedup. Data is split into chunks, and Map tasks operate on each of these chunks of data. Each chunk is called InputSplit
. A Map task is asked to operate on each InputSplit
class. There are two other classes, InputFormat
and RecordReader
, which are significant in handling inputs to Hadoop jobs.
The InputFormat class
The input data specification for a MapReduce Hadoop job is given via the InputFormat
hierarchy of classes. The InputFormat
class family has the following main functions:
Validating the input data. For example, checking for the presence of the file in the given path.
Splitting the input data into logical chunks (
InputSplit
) and assigning each of the splits to a Map task.Instantiating a
RecordReader
object that can work on eachInputSplit
class and producing records to the Map task...