Summary
This chapter addressed important data delivery practices to be considered as part of your data-engineered solution. When starting out, we stated that the data consumers’ data processing use cases always require data to be consistent, complete, and semantically correct. This chapter elaborated on that goal with best practices and examples of choices that you have to make as you implement your data solutions.
You were exposed to data streaming considerations and how bulk operations can be treated as streaming operations of smaller bulk (or micro-batch) sizes. This enables you to tune the entire system for the best performance given the technologies being used. Best practices for publishing and subscribing to data were outlined and then elaborated upon. Data flow was discussed with a study of Google’s implementation in detail with Apache Beam.
We also explored how best to organize huge volumes of data while making the output of the data factory available for...