Data ingestion from streaming data sources
We explored fundamental concepts regarding data ingestion from streaming data sources in the previous chapter when we discussed AWS Glue Schema Registry (GSR). In this section, we will learn how to implement data ingestion from streaming data sources such as Amazon Kinesis and Apache Kafka using AWS Glue Spark ETL.
Stream processing can be defined as the act of continuously incorporating new data to compute a result wherein the input data is unbounded and has no predetermined beginning or end. Apache Spark has two components for stream processing: Spark Streaming and Structured Streaming.
According to the Apache Spark documentation (https://spark.apache.org/docs/3.1.1/streaming-programming-guide.html), “Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.”
Spark Streaming introduces a high-level abstraction layer called...