Errors and recovery
Generally, the question that needs to be asked for your application is: is it critical that you receive and process all the data? If not, then on failure, you might just be able to restart the application and discard the missing or lost data. If this is not the case, then you will need to use checkpointing, which will be described in the next section.
It is also worth noting that your application's error management should be robust and self-sufficient. What we mean by this is that if an exception is non-critical, then manage the exception, perhaps log it, and continue processing. For instance, when a task reaches the maximum number of failures (specified by spark.task.maxFailures
), it will terminate processing.
Note
This property, among others, can be set during creation of the SparkContext
object or as additional command line parameters when invoking spark-shell
or spark-submit
.
Checkpointing
On batch processing, we are used to having fault tolerance. This means, in case...