Ingesting Avro files
Like Parquet, Apache Avro is a widely used format to store analytical data. Apache Avro is a leading method of serialization to record data and relies on schemas. It also provides Remote Procedure Calls (RPCs), making transmitting data easier and resolving problems such as missing fields, extra fields, and naming fields.
In this recipe, we will understand how to read an Avro file properly and later comprehend how it works.
Getting ready
This recipe will require SparkSession
with some different configurations from the previous Ingesting Parquet files recipe. If you are already running SparkSession
, stop it using the following command:
spark.stop()
We will create another session in the How to do it… section.
The dataset used here can be found at this link: https://github.com/PacktPublishing/Data-Ingestion-with-Python-Cookbook/tree/main/Chapter_7/ingesting_avro_files.
Feel free to execute the code in a Jupyter notebook or your PySpark...