Time for action – handling dirty data by using skip mode
Let's see skip mode in action by writing a MapReduce job that receives the data that causes it to fail:
Save the following Ruby script as
gendata.rb
:File.open("skipdata.txt", "w") do |file| 3.times do 500000.times{file.write("A valid record\n")} 5.times{file.write("skiptext\n")} end 500000.times{file.write("A valid record\n")} End
Run the script:
$ ruby gendata.rb
Check the size of the generated file and its number of lines:
$ ls -lh skipdata.txt -rw-rw-r-- 1 hadoop hadoop 29M 2011-12-17 01:53 skipdata.txt ~$ cat skipdata.txt | wc -l 2000015
Copy the file onto HDFS:
$ hadoop fs -put skipdata.txt skipdata.txt
Add the following property definition to
mapred-site.xml
:<property> <name>mapred.skip.map.max.skip.records</name> <value5</value> </property>
Check the value set for
mapred.max.map.task.failures
and set it to20
if it is lower.Save the following Java file as
SkipData.java
:import java...