Analyzing tradeoffs in a push versus pull data flow
A long, long time ago, we started with a data warehouse. As we discovered its inadequacies, we moved to a data lake. However, a vanilla data lake is no silver bullet, so folks would perform expensive ETL in a data lake and push curated, aggregated data slivers into a downstream warehouse for BI tools to pick up. Another architecture anti-pattern that we've seen in the field is ETL being done in a warehouse and pushing data to a lake to do ML. We have come a long way from there. Modern data lakes embrace the lakehouse paradigm, and BI tools can directly reach out to the data in a lake, bypassing the warehouse completely. We believe that this pattern will continue to gain traction in the industry. So, is the warehouse dead? Yes, in spirit, it is, but in practice, it'll take a few more years to phase out completely. So, when is it good to have any kind of specialized data stored to the right of a data lake? If it can be avoided...