The Apache Streaming module is a stream processing-based module within Apache Spark. It uses the Spark cluster to offer the ability to scale to a high degree. Being based on Spark, it is also highly fault tolerant, having the ability to rerun failed tasks by checkpointing the data stream that is being processed. The following topics will be covered in this chapter after an introductory section, which will provide a practical overview of how Apache Spark processes stream-based data:
- Error recovery and checkpointing
- TCP-based stream processing
- File streams
- Kafka stream source
For each topic, we will provide a worked example in Scala and show how the stream-based architecture can be set up and tested.