Storing data as Parquet files
Parquet (https://parquet.apache.org/) is rapidly becoming the go-to data storage format in the world of big data because of the distinct advantages it offers:
- It has a column-based representation of data. This is better represented in a picture, as follows:
As you can see in the preceding screenshot, Parquet stores data in chunks of rows, say 100 rows. In Parquet terms, these are called RowGroups. Each of these RowGroups has chunks of columns inside them (or column chunks). Column chunks can hold more than a single unit of data for a particular column (as represented in the blue box in the first column). For example. Jai, Suri, and Dhina form a single chunk even though they are composed of three single units of data for Name.
Another unique feature is that these column chunks (groups of a single column's information) can be read independently. Let's consider the following image:
We can see that the items of column data are stored next to each other in...