Best practices for managing performance
Managing cost and performance is a continuous activity. Sometimes, they can inversely affect each other, and other times, they go hand in hand. Once optimized, a workload pattern can change and need a different set of tweaks. That said, managed platforms, such as Databricks, are getting better at analyzing workloads and suggesting optimizations or directly applying them, thereby relieving the data engineer from these responsibilities. But there is still a long way to go to reach complete auto-pilot. We covered a lot of different techniques to tune your workloads; partition pruning and I/O pruning are the main ones:
- Partition pruning: It is file-based by having directories for each partition value. On-disk, it will look like
<partition_key>=<partition_value>
and a set of associated Parquet data files. If the amount of data pulled from executors back to the driver is large, usespark.driver.maxResultSize
to increase it. It may...