Differentiating batch versus real-time processing
Batch processing means processing chunks of data in a fixed interval of time. A batch process, also called a batch load, takes a considerable amount of time and compute. For example, an ETL script reading 500 GB of data from a source, transforming it, and writing to a sink at a 12-hour frequency, works as a batch process.
But a real-time process performs computation on a continuous stream of data. In other words, a real-time stream processes data as soon as it arrives. In the case of Spark, its Structured Streaming API is used to process data in real-time.
The following table illustrates the differences between batch and real-time processing in Databricks.
We will start our learning journey with batch processing and then proceed to real-time streaming.