Apache Parquet
As far as a generic storage format for a pd.DataFrame
goes, Apache Parquet is the best option. Apache Parquet allows:
- Metadata storage – this allows the format to track data types, among other features
- Partitioning – not everything needs to be in one file
- Query support – Parquet files can be queried on disk, so you don’t have to bring all data into memory
- Parallelization – reading data can be parallelized for higher throughput
- Compactness – data is compressed and stored in a highly efficient manner
Unless you are working with legacy systems, the Apache Parquet format should replace the use of CSV files in your workflows, from persisting data locally and sharing with other team members to exchanging data across systems.
How to do it
The API to read/write Apache Parquet is consistent with all other pandas APIs we have seen so far; for reading, there is pd.read_parquet
, and for...