Configurable performance-boosting features of Delta Lake
Delta Lake in Databricks has features that allow you to accelerate query performance further based on your knowledge of specific query patterns. Let’s learn about them here.
Z-ordering
Automatic stats collection is a great performance accelerator. However, it is effective only when the minimum-maximum (min-max) ranges of the query filter column(s) in each data file are narrow and optimally overlapping across data files. What does this mean?
Consider a high-cardinality column such as the TailNum
column in our flights
table, which has a cardinality of 13150. The tail number is like a registration number for airplanes. Consider a short-haul flight that does many round trips a day. This means that the tail number of this flight will be present across a lot of time bands and hence across a lot of data files. So, if we try to query the flights
table with a selective filter on TailNum
, it will not be able to effectively...