Summary
In this chapter, we talked about the importance of data modeling exercises to organize and persist the data while designing a new ETL use case so that subsequent data operations can benefit from an optimal balance of performance, cost, efficiency, and quality.
A good data model helps us with faster query speeds and reduces unnecessary I/O throughput brought about by expensive wasted scans. A design-first approach forces us to think through the data relations and can not only help reduce data redundancy but also help improve the reuse of pre-computed results, thereby reducing storage and computing costs for big data platforms. The increase in efficiency of data utilization improves the overall user experience. Having stable base datasets ensures more consistency of derived datasets further down the pipeline, thereby improving the quality of generated insights.
In the next chapter, we will look at the Delta protocol and the main features that help bring reliability, perfor...