Best Practices for ETL Pipelines
Up to this point in the book, we’ve gone through various tools and methods to create reliable, scalable, and maintainable ETL pipelines. We’ve also spent time discussing the concept of “garbage in, garbage out,” where the data quality and integrity of both the source and expected output data need to be prioritized throughout pipeline design and implementation, or the pipeline fails to perform its purpose. However, we haven’t spent a significant amount of time discussing some of the most common pitfalls to be cognizant of while building these pipelines.
In this chapter, we will discuss the importance of monitoring and logging each activity process within every pipeline you build, and how error handling and recovery mechanisms will save you hours of frustration while debugging and troubleshooting a deployed pipeline. To create effective logging, we need to first discuss which aspects of your pipelines need to be tracked...