The end-to-end pipeline
Before we write any code, we need to consider the following:
- Data is loaded daily; our process will run after this happens… Do we schedule or trigger our processing?
- We will need a way to specify the date we’ll be processing
- We need to consider the possibility of having to reprocess a date if there is an issue with the original dataset
- Do we need to write a mechanism to process more than one day?
- Do we need to write a mechanism to reprocess the whole dataset?
- What data quality rules should be put in place?
- How and where are we going to transform our data?
Here’s a high-level overview of how we can structure our Spark/Scala application to meet these requirements:
- Scheduling or triggering: We can use a scheduling tool such as ADF, Argo, Apache Airflow, or cron jobs to trigger our Spark application daily after data loading is complete. We’ll show an example in Argo in the Orchestrating our...