Master Data management
When we refer to Master Data, we are talking about the descriptive data that is at the core of an organization and the processes to ensure that we can understand when different units are referring to the same instance of a concept.
To many extents, Master Data in the data platform realm overlaps with the dimensions that describe the organization concepts, such as customer, product, employee, and so on.
The rare times when we have a Master Data dimension, it adds the semantic of containing the “golden records” selected to represent the instances of the business concept (that is, the entity) represented by the dimension.
As an example, the product MD dimension (MDD_PRODUCT
for us in the REF layer) contains the golden records of the products, starting from the codes used as the PK of the entity and continuing with the values of the columns.
Quite often, we will have only a list of MD codes, eventually with names, and mapping tables that allow us to transform local dimensions into MD codes and produce an MD-coded dimension using data from local dimensions, thanks to business rules.
When we talk about Master Data management, we are talking about the processes and data (MDD and MAP tables) for managing Master Data entities across domains.
Data for Master Data management
In Master Data management, we rely on two types of data:
- Master Data Dimensions (MDDs): This is the dimension that contains the “gold records” of the business concept. Often, there is not a full dimension but only a list of master data codes, maybe with a name or description for human use.
Its validity can be limited to a specific domain, such as US products only.
- Mapping tables (MAPs): This is a table that allows us to convert the codes from one domain into another – for example, converting Italian product codes into EMEA codes.
These mapping tables are at the heart of every MD initiative as they are the tools for making data interoperable.
They can be used to convert the codes of a dimension, but their power is to also convert the codes into facts, opening the use of facts across domains.
Mapping tables are generally expressed in terms of business keys and not surrogate keys since they have no business meaning.
A key element to be aware of for each mapping table is the directionality of mapping. It is quite common that the cardinality of a concept is different across domains, and therefore the mapping cannot be one-to-one and bidirectional.
If, in a local domain, multiple codes have been used to identify a single code in the Master Data dimension over time, then all these codes will map to the same MD code and the mapping will be many-to-one and unidirectional from the local domain to the MD domain.
This would be the case if a country has multiple codes to represent a “generic box 30x30x5 cm” product because they are sourcing them from different vendors or producing them in different plants, but the MDD has only one such product.
This also shows that a mapping table might use multiple columns to map one code across domains. Using the box as an example, we would need the MD code and the production plant code to map an MD code to the local domain.
It might be surprising to you, but the mapping tables are the ones that are needed, whereas having the master data dimension for all or even most concepts is not common. It is very welcome, but not a must.
Even when we just have the mapping table, we implicitly have also a minimal MDD that’s made by all the MD codes that are present in the mapping table. In such cases, the dimension used in data marts is derived from the local dimensions, along with some business rules to decide what data to collect from each local dimension and the eventual precedence when data can be alternatively sourced by multiple dimensions.
A light MDM approach with DBT
When we discussed modularity earlier in this chapter, we introduced the two possible extreme situations that we will find when making data interoperable across domains.
One possibility is to progressively convert the data from one domain to the next, while the other extreme is to convert everything directly to the higher-level domain, which is often considered the target master data system of reference.
Reality is hardly black and white, and you will find yourself in situations where, for some data, you have to go through a few conversion steps, while for others, you just have to take a single step. This mixed approach is often the case as some entities do not have the same semantics. Besides this, with master data code conversions, some rules are needed to have the same definition and build a more widely applicable version of the concept.
Let’s make a simple graphical example to better clarify one step of the MD conversion process. In our example, we have two systems that have Order LINES data to be converted into some master data domain to make it interoperable:
Figure 13.3: Desired result of the master data conversion – interoperable Order LINES from SYS1 and SYS2
Here, we start our conversion from data that is encoded with SYS1
and SYS2
keys (business or surrogate keys, which will be discussed next), making them not interoperable. Our goal is to produce one model coded in the desired MD domain and then be able to combine them to provide a domain-level model.
When we start looking into the Order Lines from system 1, we can see that they do not have business keys, only surrogate keys (SK
).
The problem is that SKs are only valid inside one specific environment, so the SK for the same instance is different in DEV, QA, and PROD and it will change if one environment or a dimension is recreated:
Figure 13.4: Transformations to apply MD when the source data uses surrogate keys
Because of their nature, SKs do not have any meaning for anyone and could change at any time, so they are not used in master data systems.
The master data system uses the BKs of the entities as they are meaningful to people and stable over time to identify instances of the entity.
To retrieve the BK of an entity, we can join its dimension using the SK provided.
Therefore, our first action is to replace the surrogate keys that we have in our source with the corresponding business keys.
In the preceding figure, we joined our source data in CURR_SYS1_OLINES
with the dimensions of system one (TR_SYS1_CUSTOMER
and TR_SYS1_PRODUCT
) to produce the TR_SYS1_OLINES
model that has the SYS1 BKs in place of SKs.
The next step is to join the TR_SYS1_OLINES
model with the mapping table for each BK that we have to convert from SYS1 into the desired MD domain to produce the REF_SYS1_OLINES
model, which has the MD domain BKs in place.
In this example, we have assumed that we only had customer and product keys that needed to be converted, first from SK to SYS1 BK and then to MD domain BK. It might well be that you have a different number of SKs to convert. As an example, consider that you have a country key. First, you would need to convert from SK to SYS1 BK, but if SYS1 already uses international country codes, then you might already be using the desired BKs, which means you do not need a further MD conversion for the country BK.
We also want you to note how subjective and depending on the expected usage of our models the naming of them can be.
We have used the TR_
prefix for the dimensions because they are expressed in terms of the SKs and therefore are not of general use outside SYS1. Then, we named TR_
the Order Lines model. This is where we put the SYS1 BKs as we only plan to build reporting using the domain codes, so we don’t consider it reusable. For the same reason, we named REF_
the next model that will contain the SYS1 data with DOMAIN BK.
In a way, our assumption to always do reports using MD domain BKs made us call REF_
the only model that uses such a code reference system, so we do not need to be explicit about in what domain it is a REF model.
If we had plans to produce reporting data marts that use SYS1 BKs, then we should name the TR_SYS1_OLINES
model REF_SYS1_OLINES
, and the next one that contains the same data but uses DOMAIN BKs REF_DOMxxx__SYS1_OLINES
, to make clear what BKs are used in each model and where the data is coming from.
This last convention is a bit verbose but is more clear overall. Please be aware of your assumptions and make them public when you decide how to name your models.
Now that we’ve converted system one, let’s look at what we need to do for system two:
Figure 13.5: Transformations to apply MD when the source data uses business keys
In our example, we have assumed that system two is already using its own BK, so we do not need a first step to replace the internal SK with the system BKs.
In this case, we just have to apply the MD part of the transformation to convert the SYS2 BKs into the MD domain BKs.
We have named one of the mapping tables MAP_SYS2_ITEM_TO_DOMAIN_PROD
so that we can when the concepts in the two domains are not aligned.
In our example, we assumed that, in SYS2, we have the concept of ITEM
, which has to be converted into a PRODUCT
in the MD domain. The difference and the business rules needed to do this conversion can be whatever, so we haven’t made any special assumptions. However, in such situations, a lookup on the mapping table might not be enough and you will need to implement more complex business rules and eventually use intermediate models to simplify the computation and make it clear for other developers what transformations are being done and why.
Once you have transformed the data from both source systems so that we can use the desired MD domain BK, you can put them together by selecting the common fields and applying eventual business rules to produce the columns that are not immediately available in all the inputs. In some cases, you might add intermediate steps to perform this adaptation.
One simple example is that some inputs might already have amounts in the local currency and the desired accounting currency, while others might need one extra transformation to calculate the amounts in the accounting currency from the ones in the local currency. Here, we are back to normal business rules and not something specific to MD conversions.
In this section, we discussed how to apply Master Data transformations to make data from different systems interoperable. Having data interoperability across sources is by far the most important goal that we have when we build a data platform that has as input a plurality of data sources that use different BKs for some of the data.
In the next section, we will look at an advanced way to store the input data that we use in all our projects running on cloud databases such as Snowflake or BigQuery.