Summary
We began this chapter by outlining why it is imperative to have data quality checks in place for any data pipeline. We then introduced the Deequ library developed by Amazon and its various components. Deequ uses Spark at its core, thereby leveraging the distributed processing that comes with it. We then took a deep dive into the various functionalities offered by Deequ, such as the automatic suggestion of constraints, defining constraints, metrics repositories, and so on.
In the next chapter, we are going to look at code health and maintainability, along with test-driven development (TDD), which is vital for a scalable and easily maintainable code base.