Avro is a very popular data serialization system that provides a compact and fast binary data format. Avro files are self-describing because the schema is stored along with the data.
You can download spark-avro connector JAR from https://mvnrepository.com/artifact/com.databricks/spark-avro_2.11/3.2.0.
We will switch to Spark 2.1 for this section. At the time of writing this book due to a documented bug in the spark-avro connector library, we are getting exceptions while writing Avro files (using spark-avro connector 3.2) with Spark 2.2.
Start Spark shell with the spark-avro JAR included in the session:
Aurobindos-MacBook-Pro-2:spark-2.1.0-bin-hadoop2.7 aurobindosarkar$ bin/spark-shell --jars /Users/aurobindosarkar/Downloads/spark-avro_2.11-3.2.0.jar
We will use the JSON file from the previous section containing the Amazon reviews data to create...