Summary
Hopefully, this chapter presented the topic of data life cycle management as something other than a dry abstract concept. We covered a lot, particularly:
- The definition of data life cycle management and how it covers a number of issues and techniques that usually become important with large data volumes
- The concept of building a data ingest pipeline along good data life cycle management principles that can then be utilized by higher-level analytic tools
- Oozie as a Hadoop-focused workflow manager and how we can use it to compose a series of actions into a unified workflow
- Various Oozie tools, such as subworkflows, parallel action execution, and global variables, that allow us to apply true design principles to our workflows
- HCatalog and how it provides the means for tools other than Hive to read and write table-structured data; we showed its great promise and integration with tools such as Pig but also highlighted some current weaknesses
- Avro as our tool of choice to handle schema evolution...