Before we try to understand the importance of Metadata, let's try to understand what metadata is. Metadata is simply data about data. This sounds confusing as we are defining the definition in a recursive way.
In a typical big data system, we have these three levels of verticals:
- Applications writing data to a big data system
- Organizing data within the big data system
- Applications consuming data from the big data system
This brings up a few challenges as we are talking about millions (even billions) of data files/segments that are stored in the big data system. We should be able to correctly identify the ownership, usage of these data files across the Enterprise.
Let's take an example of a TV broadcasting company that owns a TV channel; it creates television shows and broadcasts it to all the target audience over wired cable networks, satellite networks, the internet, and so on. If we look carefully, the source of content is only one. But it's traveling through all possible mediums and finally reaching the user’s Location for viewing on TV, mobile phone, tablets, and so on.
Since the viewers are accessing this TV content on a variety of devices, the applications running on these devices can generate several messages to indicate various user actions and preferences, and send them back to the application server. This data is pretty huge and is stored in a big data system.
Depending on how the data is organized within the big data system, it's almost impossible for outside applications or peer applications to know about the different types of data being stored within the system. In order to make this process easier, we need to describe and define how data organization takes place within the big data system. This will help us better understand the data organization and access within the big data system.
Let's extend this example even further and say there is another application that reads from the big data system to understand the best times to advertise in a given TV series. This application should have a better understanding of all other data that is available within the big data system. So, without having a well-defined metadata system, it's very difficult to do the following things:
- Understand the diversity of data that is stored, accessed, and processed
- Build interfaces across different types of datasets
- Correctly tag the data from a security perspective as highly sensitive or insensitive data
- Connect the dots between the given sets of systems in the big data ecosystem
- Audit and troubleshoot issues that might arise because of data inconsistency