Reference data and master data are two common data categories that an architect would typically come across in most projects. In addition to these, in this section, you will also discover the characteristics of transactional data, reporting data, metadata, big data, and unstructured data. Having a deeper understanding of these different data categories will help you craft your overall data strategy, including data governance. This will also help you speak the same language your data architects prefer to use. Have a look at each of them closely.
Transactional Data
Transactional data is generated by regular business transactions. It describes business events. Normally, it is the most frequently changing data in the enterprise. Transactional data events could include the following:
- Products sold to customers
- Collected payments
- Created quotes
- Items shipped to customers
Transactional data is normally generated and managed by operational systems, such as CRM, ERM, and HR applications.
Master Data and Master Data Management
Enterprises normally provide key business information that supports daily transactions. Such data normally describes customers, products, locations, and so on.
Such data is called master data, and it is commonly referred to as parties (employees, customers, suppliers, and so on), places (sites, regions, and so on), and things (products, assets, vehicles, and so on).
The usual business operations normally author/create and use master data as part of the normal course of business processes. However, operational applications are usually designed for an application-specific use case for the master data. This could result in a misalignment with the overall enterprise requirement of high-quality, commonly used master data. This would result in the following:
- Master data being low quality
- Duplicated and scattered data
- Lack of truly managed data
Master data management (MDM) is a concept widely used to describe the discipline where IT and business work together to ensure the accuracy and uniformity of the enterprise master data using specific tools and technologies. Maintaining a single version of the truth is one of the highest-priority topics on the agenda of most organizations.
A master data management tool is used to detect and remove duplicates, mass maintain data, and incorporate rules to prevent incorrect data from being entered.
There are different MDM implementation styles, such as the registry, consolidation, and coexistence styles. These styles are the foundation that MDM tools are based on. The business type, its data management strategy, and its situation will largely impact which style is selected.
The main difference between these implementation styles is in the way they deal with data, as well as the role of the MDM tool itself (it is a hub that controls the data or synchronizes data with other data stores). Here are three common ones in use:
- Registry style: This style spots duplicates in various connected systems by running a match, cleansing algorithms, and assigning unique global identifiers to matched records. This helps identify the related records and build a comprehensive 360 degree view of the given data asset across the systems.
In this approach, data is not sent back to the source systems. Changes to master data will continue to take place in source systems. The MDM tool assumes that the source systems can manage their own data quality. When a 360-degree view of a particular data asset is required (for example, the customer needs it), the MDM tool uses each reference system to build the 360-degree views in real time using the unique global identifier. There is normally a need for a continuous process to ensure the unique global identifier is still valid for a given dataset.
- Consolidation style: In this style, the data is normally gathered from multiple sources and consolidated in a hub to create a single version of the truth. This is sometimes referred to as the golden record, which is stored centrally in the hub and eventually used for reporting or as a reference. Any updates that are made to the golden record are then pushed and applied to the original sources. The consolidation style normally follows these steps:
- Identify identical or similar records for a given object. This can use an exact match or a fuzzy match algorithm.
- Determine records that will be automatically consolidated.
- Determine records that will require review by the data steward before they can be consolidated.
During these processes, the tool might use configured field weights. Fields with higher weights are normally considered more important factors when it comes to determining the attributes that would eventually form the golden record.
- Coexistence style: This style is similar to the consolidation style in the sense that it creates a golden record. However, the master data changes can take place either in the MDM hub or in the data source systems. This approach is normally more expensive to apply than the consolidation style due to the complexity of syncing data both ways.
It is also good to understand how the matching logic normally works. There are two main matching mechanisms:
- Fuzzy matching: This is the more commonly used mechanism but is slower to execute due to the effort required to identify the match. Fuzzy matching uses a probabilistic determination based on possible variations in data patterns, such as transpositions, omissions, truncation, misspellings, and phonetic variations.
- Exact matching: This mechanism is faster because it compares the fields on the records with their identical matches from target records.
As a Salesforce Architect, you need to understand the capabilities and limitations of the out-of-the-box tools available in the platform, in addition to the capabilities that are delivered by key MDM players in the market—particularly those that have direct integration with Salesforce via an AppExchange application.
In your review board presentation, as well as in real life, you are expected to guide your stakeholders and explain the different options they have. You should be able to suggest a name for a suitable tool. It is not enough to mention that you need to use an MDM tool as you need to be more specific and provide a suggested product name. This is applicable to all third-party products you may suggest. You may also be asked to explain how these MDM tools will be used to solve a particular challenge. You need to understand the MDM implementation style that is being adopted by your proposed tool.
Reference Data
Think of data such as order status (created, approved, delivered, and so on) or a list of country names with their ISO code. You can also think of a business account type such as silver, gold, platinum, and so on. Both are examples of reference data.
Reference data is typically static or slowly changing data that is used to categorize or classify other data. Reference datasets are sometimes referred to as lookup data. Some of this reference data can be universal (such as the countries with ISO codes, as mentioned earlier), while others might be domain-specific or enterprise-specific.
Reference data is different from master data. They both provide context to business processes and transactions. However, reference data is mostly concerned with categorization and classification, while master data is mainly related to business entities (for example, customers).
Reporting Data
Data organized in a specific way to facilitate reporting and business intelligence is referred to as reporting data. Data used for operational reporting either in an aggregated or non-aggregated fashion belongs in this category.
Reporting data is created by combining master data, reference data, and transactional data.
Metadata
Metadata has a very cool definition; it is known as the data that describes other data. eXtensible Markup Language (XML) is a commonly used metadata format.
A simple example is the properties of a computer file, that is, its size, type, author, and creation date. Salesforce utilizes metadata heavily since most of its features can be described using well-structured metadata. Custom objects, workflows, report structures, validation rules, page layouts, and many more features can all be extracted from the Salesforce Platform as metadata. This enables automated merge and deployment processes, something that you will cover in the chapters to come.
Big Data
Big data refers to datasets that are too massive to be handled by traditional databases or data processing applications; that is, datasets containing hundreds of millions (even billions) of rows. Big data became popular in the past decade, especially with the decreased cost of data storage and increased processing capacity.
Businesses are increasingly understanding the importance of their data and the significant benefits they can gain from it. They do not want to throw away their data and want to make use of it for many purposes, including AI-driven decisions and personalized customer-centric services and journeys. It is worth mentioning that there have been some critics of the big data approach. Some describe it as simply dumping the data somewhere in the hope that it will prove to be useful someday. However, it is also clear that when big data is used in the right way, it can prove to be a differentiator for a particular business. The future seems promising for big data, which is a key ingredient for machine learning algorithms.
As an architect, you should be able to guide the client through combining the technologies required to manage a massive quantity and variety of data (likely in a non-relational database) versus relational data (to handle complex business logic).
Unstructured Data
Unstructured data became more popular due to the internet boom. This data does not have a predefined structure and therefore cannot be fit into structured RDBMSs. In most cases, this type of data is presented as text data; for example, a PDF file where text mining would help with extracting structure and relevant data from this unstructured document.
The rise of big data and unstructured data takes us straight to our next topic—data warehousing and data lakes.