New York Yellow Taxi data, ETL pipeline, and deployment
The previous exercise was a great example of refactoring legacy, less ideal implementations of ETL pipelines into clean ETL design pipelines. However, the datasets we used were quite simple and not entirely reflective of data you will come across in reality. It also lacked the pillars of unit testing and validation, which inevitably diminished the potential robustness of the pipeline.
In this scenario, we’ll take things a step further and build a pipeline that is more similar to what you might encounter in a professional setting. This pipeline will include professional coding practices, such as error handling, modularity for easy extension, and unit testing.
We will use New York 2021 Yellow Taxi Trip Data (https://data.cityofnewyork.us/Transportation/2021-Yellow-Taxi-Trip-Data/m6nq-qud6), an open source dataset that is significantly larger and more complex than the data in the previous example. It contains detailed...