Content registry
We have seen in this chapter that data ingestion is an area that is often overlooked, and that its importance cannot be underestimated. At this point, we have a pipeline that enables us to ingest data from a source, schedule that ingest, and direct the data to our repository of choice. But the story does not end there. Now we have the data, we need to fulfil our data management responsibilities. Enter the content registry.
We're going to build an index of metadata related to that data we have ingested. The data itself will still be directed to storage (HDFS, in our example) but, in addition, we will store metadata about the data, so that we can track what we've received and understand basic information about it, such as, when we received it, where it came from, how big it is, what type it is, and so on.
Choices and more choices
The choice of which technology we use to store this metadata is, as we have seen, one based upon knowledge and experience. For metadata indexing...