Input/output
There is one aspect of our driver classes that we have mentioned several times without getting into a detailed explanation: the format and structure of the data input into and output from MapReduce jobs.
Files, splits, and records
We have talked about files being broken into splits as part of the job startup and the data in a split being sent to the mapper implementation. However, this overlooks two aspects: how the data is stored in the file and how the individual keys and values are passed to the mapper structure.
InputFormat and RecordReader
Hadoop has the concept of an InputFormat
for the first of these responsibilities. The InputFormat
abstract class in the org.apache.hadoop.mapreduce
package provides two methods as shown in the following code:
public abstract class InputFormat<K, V> { public abstract List<InputSplit> getSplits( JobContext context) ; RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context) ; }
These methods display...