You're reading from Business Intelligence with Databricks SQL Concepts, tools, and techniques for scaling business intelligence on the data lakehouse

Product type Paperback

Published in Sep 2022

Publisher Packt

ISBN-13 9781803235332

Length 348 pages

Edition 1st Edition

Languages

SQL

Concepts

Business Intelligence

Author (1):

Vihag Gupta

View More author details

Table of Contents (21) Chapters

Preface

1. Part 1: Databricks SQL on the Lakehouse

2. Chapter 1: Introduction to Databricks FREE CHAPTER

3. Chapter 2: The Databricks Product Suite – A Visual Tour

4. Chapter 3: The Data Catalog

5. Chapter 4: The Security Model

6. Chapter 5: The Workbench

7. Chapter 6: The SQL Warehouses

8. Chapter 7: Using Business Intelligence Tools with Databricks SQL

9. Part 2: Internals of Databricks SQL

10. Chapter 8: The Delta Lake

11. Chapter 9: The Photon Engine

12. Chapter 10: Warehouse on the Lakehouse

13. Part 3: Databricks SQL Commands

14. Chapter 11: SQL Commands – Part 1

15. Chapter 12: SQL Commands – Part 2

16. Part 4: TPC-DS, Experiments, and Frequently Asked Questions

17. Chapter 13: Playing with the TPC-DS Dataset

18. Chapter 14: Ask Me Anything

19. Index

Why subscribe?

20. Other Books You May Enjoy

Configurable performance-boosting features of Delta Lake

Delta Lake in Databricks has features that allow you to accelerate query performance further based on your knowledge of specific query patterns. Let’s learn about them here.

Z-ordering

Automatic stats collection is a great performance accelerator. However, it is effective only when the minimum-maximum (min-max) ranges of the query filter column(s) in each data file are narrow and optimally overlapping across data files. What does this mean?

Consider a high-cardinality column such as the TailNum column in our flights table, which has a cardinality of 13150. The tail number is like a registration number for airplanes. Consider a short-haul flight that does many round trips a day. This means that the tail number of this flight will be present across a lot of time bands and hence across a lot of data files. So, if we try to query the flights table with a selective filter on TailNum, it will not be able to effectively...