Using the Avro data model in Parquet
Parquet is a kind of highly efficient columnar storage, but it is also relatively new. Avro (https://avro.apache.org) is a widely used row-based storage format. This recipe showcases how we can retain the older and flexible Avro schema in our code but still use the Parquet format during storage.
The Spark MR project (yes, the one that has the Parquet tools we saw in the previous recipe) has converters for almost all the popular data formats. These model converters take your format and convert it into Parquet format before causing it to persist.
How to do it…
In this recipe, we'll use the Avro data model and serialize the data in a Parquet file. The recipe involves the following steps:
- Create the Avro Model.
- Generate Avro objects using the
sbt avro
plugin. - Construct the RDD of your generated object (
StudentAvro
) fromStudents.csv
. - Save the
RDD[StudentAvro]
in a Parquet file. - Read the file back for verification.
- Use
Parquet-tools
to verify.