Handling data spills
Data spill refers to the process where a compute engine such as SQL or Spark, while executing a query, is unable to hold the required data in memory and writes (spills) to disk. This results in increased query execution time due to the expensive disk reads and writes. Spills can occur for any of the following reasons:
- The data partition size is too big.
- The compute resource size is small, especially the memory.
- The exploded data size during merges, unions, and so on exceeds the memory limits of the compute node.
Solutions for handling data spills would be as follows:
- Increase the compute capacity, especially the memory if possible. This will incur higher costs, but is the easiest of the options.
- Reduce the data partition sizes, and repartition if necessary. This is more effort-intensive as repartitioning takes time and effort. If you are not able to afford the higher compute resources, then reducing the data partition sizes is...