Optimizing the number of files and each file size
The number of files and each file size are also related to the performance of your analytic workloads. In particular, the number of files and file sizes are related to the performance of the data retrieval phase by using an analytic engine in your analytic workloads. To understand the relationship between the number of files and the file size and the performance of the data retrieval process by an analytic engine, we’ll look at how the engine generally retrieves data and returns the result as follows.
The basic process of data retrieval and returning a result is firstly getting a list of files, reading each file, processing the contents of the files based on your queries, and then returning the result. In particular, when processing data in Amazon S3, the analytic engine lists objects in your specified S3 bucket, gets objects, reads the contents, then processes and returns the result. When you use an AWS Glue ETL Spark job...