Writing a Spark data pipeline
In this section, you will build a real data pipeline for gathering and processing datasets. The objective of the processing is to format, clean, and transform data into a state that is useable for model training. Before writing our data pipeline, let's first understand the data.
Preparing the environment
In order to perform the following exercises, we first need to set up a couple of things. You need to set up a PostgreSQL database to hold the historical flights data. And you need to upload files to an S3 bucket in MinIO. We used both a relational database and an S3 bucket to better demonstrate how to gather data from disparate data sources.
We have prepared a Postgres database container image that you can run on your Kubernetes cluster. The container image is available at https://quay.io/repository/ml-on-k8s/flights-data. It runs a PostgreSQL database with preloaded flights data in a table called flights
.
Go through the following steps...