Test your knowledge
Before moving on to the next chapter, test your knowledge with the following questions:
- Assume that the volume of data you receive in every micro batch of the stream is very small (in KB) and, in your data lake, you plan to maintain a minimum 64-128 MB file size for better read performance. How should you design the pipeline and what trade-offs should you consider?
- Assume, owing to infrastructure failures, that your EMR cluster got terminated but your source application is still continuously sending events to Kinesis Data Streams. When you restart your EMR cluster to resume the flow, how would you make sure that you do not lose any messages while processing the data using Spark?