Summary
Hopefully, this chapter presented the topic of data life cycle management as something other than a dry abstract concept. We covered a lot, particularly:
The definition of data life cycle management and how it covers a number of issues and techniques that usually become important with large data volumes
The concept of building a data ingest pipeline along good data life cycle management principles that can then be utilized by higher-level analytic tools
Oozie as a Hadoop-focused workflow manager and how we can use it to compose a series of actions into a unified workflow
Various Oozie tools, such as subworkflows, parallel action execution, and global variables, that allow us to apply true design principles to our workflows
HCatalog and how it provides the means for tools other than Hive to read and write table-structured data; we showed its great promise and integration with tools such as Pig but also highlighted some current weaknesses
Avro as our tool of choice to handle schema evolution...