Preparing the essential prerequisites
In this section, we will ensure that the following prerequisites are ready before proceeding with the hands-on solutions of this chapter:
- The Parquet file to be analyzed and processed
- The S3 bucket where the Parquet file will be uploaded
Downloading the Parquet file
In this chapter, we will work with a similar bookings
dataset as the one used in previous chapters. However, the source data is stored in a Parquet file this time, and we have modified some of the rows so that the dataset will have dirty data. That said, let’s download the synthetic.bookings.dirty.parquet
file onto our local machine.
You can find it here: https://github.com/PacktPublishing/Machine-Learning-Engineering-on-AWS/raw/main/chapter05/synthetic.bookings.dirty.parquet.
Note
Note that storing data using the Parquet format is preferable to storing data using the CSV format. Once you need to work with much larger datasets, the difference...