Pulling it all together
Let's review what we've discussed until now and how we can use Oozie to build a sophisticated series of workflows that implement an approach to data life cycle management by putting together all the discussed techniques.
First, it's important to define clear responsibilities and implement parts of the system using good design and separation of concern principles. By applying this, we end up with several different workflows:
A subworkflow to ensure the environment (mainly HDFS and Hive metadata) is correctly configured
A subworkflow to perform data validation
The main workflow that triggers both the preceding subworkflows and then pulls new data through a multistep ingest pipeline
A coordinator that executes the preceding workflows every 10 minutes
A second coordinator that ingests reference data that will be useful to the application pipeline
We also define all our tables with Avro schemas and use them wherever possible to help manage schema evolution and changing data formats...