Writing Large Datasets
In this recipe, you will explore how the choice of the different file formats can impact the overall write and read performance. You will explore Parquet, Optimized Row Columnar (ORC), and Feather and compare their performance to other popular file formats such as JSON and CSV.
The three file formats, ORC, Feather, and Parquet, are columnar file formats, making them efficient for analytical needs, and showing improved querying performance overall. The three file formats are also supported in Apache Arrow (PyArrow), which offers an in-memory columnar format for optimized data analysis performance. To persist this in-memory columnar and store it, you can use pandas to_orc
, to_feather
, and to_parquet
writer functions to persist your data to disk.
Arrow provides the in-memory representation of the data as a columnar format while Feather, ORC, and Parquet allows us to store this representation to disk.
Getting Ready
In this recipe, you will be working with the New...