Checkpointing
As it is expected that real-time streaming applications will run for extended periods of time while remaining resilient to failure, Spark Streaming implements a mechanism called checkpointing. This mechanism tracks enough information to be able to recover from any failures. There are two types of data checkpointing:
- Metadata checkpointing
- Data checkpointing
Checkpointing is enabled by calling checkpoint()
on the StreamingContext
:
def checkpoint(directory: String)
This specifies the directory where the checkpoint data is to be stored. Note that this must be a filesystem that is fault tolerant, such as HDFS.
Once the directory for the checkpoint is set, any DStream can be checkpointed into it, based on an interval. Revisiting the Twitter example, each DStream can be checkpointed every 10 seconds:
val ssc = new StreamingContext(sc, Seconds(5)) val twitterStream = TwitterUtils.createStream(ssc, None) val wordStream = twitterStream.flatMap(x => x.getText().split(" ")) val aggStream...