Parquet - an efficient and interoperable big data format
We explored the Parquet format in Chapter 7, Spark 2.0 Concepts. To recap, Parquet is essentially an interoperable storage format. Its main goals are space efficiency and query efficiency. Parquet's origin is based on Google's Dremel and was developed by Twitter and Cloudera. It is now an Apache incubator project. The nested storage format from Google Dremel is implemented in Parquet. It stores data in a columnar format and has an evolvable schema. This enables you to optimize queries (it can restrict the columns that you need to access and so you need not bring all the columns into the memory and discard the ones not needed), and it allows storage optimization (by decoding at the column level, which gives a much higher compression ratio). Another interesting feature is that Parquet can store nested Datasets. This feature can be leveraged in curated data lakes to store subject-based data. In addition to the ability to restrict...