How to choose the right data format
Not all tools support all of the data formats. Every tool reads data off disk in chunks of blocks (KB/MB/GB), that is, minimizing these fetches helps improve the speed of access to data. Conversely, a single read for a single record brings back a lot more data than you may want, so caching it may help with subsequent queries. Different systems have different default block sizes. To choose the right data format, you need to consider several factors, such as the following:
- What is the optimal tradeoff between cost, performance, and throughput considerations of ingestion and access patterns?
- Are you constrained by storage or memory or CPU or I/O?
- How large is a file? If your data is not splittable, we lose the parallelism that allows fast queries.
- How many columns are being stored, and how many columns are used for the analysis?
- Does your data change over time? If it does, how often does it happen, and how does it change?...