Summary
Over the course of this chapter, we have dived deep into a real-time streaming use case, where we have integrated the data pipeline with Amazon S3, Amazon EMR, AWS Glue, and Amazon Athena.
We have covered detailed implementation steps, which you can follow to replicate the same or customize as per your use case. For our implementation, we have leveraged the Kinesis Data Generator UI tool to replicate clickstream data generation and push to Kinesis Data Streams. During your production implementation, your web application should push data to Kinesis Data Streams in real time.
At the end, we provided an overview of a few important parts of the EMR PySpark script, which can provide you with a starting point.
That concludes this chapter! Hopefully, this helped you get an idea of how you can integrate real-time streaming pipelines, and, in the next chapter, we will integrate another use case that implements UPSERT
or MERGE
in a data lake using the Apache Hudi framework...