Optimizing Delta tables in a Synapse Spark pool lake database
As covered in the Processing data using Spark pools and lake databases recipe of Chapter 8, Processing Data Using Azure Synapse Analytics, a lake database allows you to store processed data in Delta tables, which are powered by Parquet files. Delta tables are very suitable for storing processed data that can be consumed by reporting solutions such as Power BI.
To achieve optimal performance in Delta tables, it is essential to evenly distribute the data among the Parquet files and purge the unwanted ones. The OPTIMIZE
command helps optimally distribute the data among Parquet files, while the VACUUM
command purges redundant Parquet files from the Azure Data Lake filesystem. The OPTIMIZE
and VACUUM
commands need to be executed regularly on the lake database so that you have optimal performance for the queries run against Delta tables.
In this recipe, we will be writing a script that can scan all Delta tables, optimize...