Collecting additional data
Many data processing systems don't have a single data ingest source; often, one primary source is enriched by other secondary sources. We will now look at how to incorporate the retrieval of such reference data into our data warehouse.
At a high level, the problem isn't very different from our retrieval of the raw tweet data, as we wish to pull data from an external source, possibly do some processing on it, and store it somewhere where it can be used later. But this does highlight an aspect we need to consider; do we really want to retrieve this data every time we ingest new tweets? The answer is certainly no. The reference data changes very rarely, and we could easily fetch it much less frequently than new tweet data. This raises a question we've skirted until now: just how do we schedule Oozie workflows?
Scheduling workflows
Until now, we've run all our Oozie workflows on demand from the CLI. Oozie also has a scheduler that allows jobs to be started either on a...