Optimizing queries using DFP
DFP is a Delta Lake feature that automatically skips files that are not relevant to a query. It is a default option in Azure Databricks and works by collecting data about files in Delta Lake, without the need to explicitly state that a file should be skipped on a query, improving performance by making use of the granularity of the data.
The behavior of DFP concerning whether a process is enabled or not, the minimum size of a table, and the minimum number of files needed to trigger a process can be managed by the following options:
spark.databricks.optimizer.dynamicPartitionPruning
(default istrue
): Whether DFP is enabled or not.spark.databricks.optimizer.deltaTableSizeThreshold
(default is10 GB
): The minimum size of the Delta table that activates DFP.spark.databricks.optimizer.deltaTableFilesThreshold
(default is1000
): Represents the number of files of the Delta table on the probe side of the join required to trigger DFP. If the...