Ingesting Parquet files
Apache Parquet is a columnar storage format that is open source and designed to support fast processing. It is available to any project in a Hadoop ecosystem and can be read in different programming languages.
Due to its compression and fastness, this is one of the most used formats when needing to analyze data in great volume. The objective of this recipe is to understand how to read a collection of Parquet files using PySpark in a real-world scenario.
Getting ready
For this recipe, we will need SparkSession
to be initialized. You can use the code provided at the beginning of this chapter to do so.
The dataset for this recipe will be Yellow Taxi Trip Records from New York. You can download it by accessing the NYC Government website and selecting 2022 | January | Yellow Taxi Trip Records or using this link:
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet
Feel free to execute the code with a Jupyter notebook...