Using Spark with Avro files
Avro is a very data serialization system that provides a and fast binary data format. Avro files are self-describing because the schema is stored along with the data.
You can download spark-avro connector
JAR from https://mvnrepository.com/artifact/com.databricks/spark-avro_2.11/3.2.0.
Note
We will switch to Spark 2.1 for this section. At the time of writing this book due to a documented bug in the spark-avro connector
library, we are getting exceptions while writing Avro files (using spark-avro connector 3.2
) with Spark 2.2.
Start Spark shell with the spark-avro JAR included in the session:
Aurobindos-MacBook-Pro-2:spark-2.1.0-bin-hadoop2.7 aurobindosarkar$ bin/spark-shell --jars /Users/aurobindosarkar/Downloads/spark-avro_2.11-3.2.0.jar
We will use the JSON file from the previous section containing the Amazon reviews data to create the Avro
file. Create a DataFrame from the input JSON file and display the number of records:
scala> import com.databricks.spark.avro...