Fault tolerance and reliability
Streaming jobs are designed to run continuously and failure in the job can result in loss of data, state, or both. Making streaming jobs fault tolerant becomes one of the essential goals of writing the streaming job in the first place. Any streaming job comes with some guarantees either by design or by implementing certain configuration features, which mandates how many times a message will be processed by the system:
- At most once guarantee: Records in such systems can either be processed once or not at all. These systems are least reliable as far as streaming solution is concerned.
- At least once guarantee: The system will process the record at least once and hence by design there will be no loss of messages, but then messages can be processed multiple times giving the problem of duplication. This scenario however is better than the previous case and there are use cases where duplicate data may not cause any problem or can easily be deduced.
- Exactly once guarantee...