Handling Data Spill
Data spillage happens when a compute engine (such as SQL or Spark) cannot keep the needed data in memory while running a query and has to save some data to disk. This makes the query slower because disk reads and writes involve accessing physical storage devices, which are typically slower compared to accessing data in memory. As a result, the time it takes to read from and write to disk increases, leading to slower query performance.
Data spills can happen when the data partitions are too large, the compute resources are too small, especially the memory, and the data size grows too much during merges, unions, and so on and goes over the memory limit of the compute node.
Consider the IAC scenario here. You are working on generating an annual report on data collected from their trips over the past year. The data includes information such as trip dates, origins, destinations, and trip durations. After analyzing the data, you notice that the number of trips in...