Choosing the proper file size
Azure Data Lake Storage Gen2 and the compute engines from Spark (Databricks and Synapse Analytics Spark pools) as well as Data Factory are optimized to perform better on larger file sizes. When queries need to work with a lot of small files, the extra overhead can quickly become a performance bottleneck.
Apart from performance considerations, reading a 16 MB file is cheaper than reading four 4 MB files. Reading the first block of a file incurs more costs than reading subsequent blocks.
Optimal file sizes are between 64 MB and 1 GB per file. This can be a challenge in the RAW zone where you may not have a lot of control over file sizes.
Good testing to find the optimum number of files and file sizes for the compute you use is key here.
After learning the basics of the setup of a data lake, let's start implementing one by provisioning an Azure storage account.