Building the data warehouse
A data warehouse is a database built for analysis and reporting. In other words, a data warehouse is a database in which the only data entry point is through ETL, and its primary purpose is to cover reporting and data analysis requirements. This definition clarifies that a data warehouse is not like other transactional databases that operational systems write data into. When there is no operational system that works directly with a data warehouse, and when the main purpose of this database is for reporting, then the design of the data warehouse will be different from that of transactional databases.
If you recall from the database normalization concepts, the main purpose of normalization is to reduce the redundancy and dependency. The following table shows customers' data with their geographical information:
Let's elaborate on this example. As you can see from the preceding list, the geographical information in the records is redundant. This redundancy makes it difficult to apply changes. For example, in the structure, if Remuera, for any reason, is no longer part of the Auckland city, then the change should be applied on every record that has Remuera as part of its suburb. The following screenshot shows the tables of geographical information:
So, a normalized approach is to retrieve the geographical information from the customer table and put it into another table. Then, only a key to that table would be pointed from the customer table. In this way, every time the value Remuera changes, only one record in the geographical region changes and the key number remains unchanged. So, you can see that normalization is highly efficient in transactional systems.
This normalization approach is not that effective on analytical databases. If you consider a sales database with many tables related to each other and normalized at least up to the third normalized form (3NF), then analytical queries on such databases may require more than 10 join conditions, which slows down the query response. In other words, from the point of view of reporting, it would be better to denormalize data and flatten it in order to make it easier to query data as much as possible. This means the first design in the preceding table might be better for reporting.
However, the query and reporting requirements are not that simple, and the business domains in the database are not as small as two or three tables. So real-world problems can be solved with a special design method for the data warehouse called dimensional modeling. There are two well-known methods for designing the data warehouse: the Kimball and Inmon methodologies.
The Inmon and Kimball methods are named after the owners of these methodologies. Both of these methods are in use nowadays. The main difference between these methods is that Inmon is top-down and Kimball is bottom-up. In this chapter, we will explain the Kimball method. You can read more about the Inmon methodology in Building the Data Warehouse, William H. Inmon, Wiley (http://www.amazon.com/Building-Data-Warehouse-W-Inmon/dp/0764599445), and about the Kimball methodology in The Data Warehouse Toolkit, Ralph Kimball, Wiley (http://www.amazon.com/The-Data-Warehouse-Toolkit-Dimensional/dp/0471200247). Both of these books are must-read books for BI and DW professionals and are reference books that are recommended to be on the bookshelf of all BI teams. This chapter is referenced from The Data Warehouse Toolkit, so for a detailed discussion, read the referenced book.
To gain an understanding of data warehouse design and dimensional modeling, it's better to learn about the components and terminologies of a DW. A DW consists of Fact tables and dimensions. The relationship between a Fact table and dimensions are based on the foreign key and primary key (the primary key of the dimension table is addressed in the fact table as the foreign key).
Facts are numeric and additive values in the business process. For example, in the sales business, a fact can be a sales amount, discount amount, or quantity of items sold. All of these measures or facts are numeric values and they are additive. Additive means that you can add values of some records together and it provides a meaning. For example, adding the sales amount for all records is the grand total of sales.
Dimension tables are tables that contain descriptive information. Descriptive information, for example, can be a customer's name, job title, company, and even geographical information of where the customer lives. Each dimension table contains a list of columns, and the columns of the dimension table are called attributes. Each attribute contains some descriptive information, and attributes that are related to each other will be placed in a dimension. For example, the customer dimension would contain the attributes listed earlier.
Each dimension has a primary key, which is called the surrogate key. The surrogate key is usually an auto increment integer value. The primary key of the source system will be stored in the dimension table as the business key.
The Fact table is a table that contains a list of related facts and measures with foreign keys pointing to surrogate keys of the dimension tables. Fact tables usually store a large number of records, and most of the data warehouse space is filled by them (around 80 percent).
Grain is one of the most important terminologies used to design a data warehouse. Grain defines a level of detail that stores the Fact table. For example, you could build a data warehouse for sales in which Grain is the most detailed level of transactions in the retail shop, that is, one record per each transaction in the specific date and time for the customer and sales person. Understanding Grain is important because it defines which dimensions are required.
There are two different schemas for creating a relationship between fact and dimensions: the snow flake and star schema. In the start schema, a Fact table will be at the center as a hub, and dimensions will be connected to the fact through a single-level relationship. There won't be (ideally) a dimension that relates to the fact through another dimension. The following diagram shows the different schemas:
The snow flake schema, as you can see in the preceding diagram, contains relationships of some dimensions through intermediate dimensions to the Fact table. If you look more carefully at the snow flake schema, you may find it more similar to the normalized form, and the truth is that a fully snow flaked design of the fact and dimensions will be in the 3NF.
The snow flake schema requires more joins to respond to an analytical query, so it would respond slower. Hence, the star schema is the preferred design for the data warehouse. It is obvious that you cannot build a complete star schema and sometimes you will be required to do a level of snow flaking. However, the best practice is to always avoid snow flaking as much as possible.
An example of Internet sales
After a quick definition of the most common terminologies in dimensional modeling, it's now time to start designing a small data warehouse. One of the best ways of learning a concept and method is to see how it will be applied to a sample question.
Assume that you want to build a data warehouse for the sales part of a business that contains a chain of supermarkets; each supermarket sells a list of products to customers, and the transactional data is stored in an operational system. Our mission is to build a data warehouse that is able to analyze the sales information.
Before thinking about the design of the data warehouse, the very first question is what is the goal of designing a data warehouse? What kind of analytical reports would be required as the result of the BI system? The answer to these questions is the first and also the most important step. This step not only clarifies the scope of the work but also provides you with the clue about the Grain.
Defining the goal can also be called requirement analysis. Your job as a data warehouse designer is to analyze required reports, KPIs, and dashboards. Let's assume that the decision maker of a particular supermarket chain wants to have analytical reports such as the comparison of sales between stores, or the top 10 customers and/or top 10 bestselling products, or he wants to compare the sale on weekdays with weekends.
After requirement analysis, the dimensional modeling phases will start. Based on Kimball's best practices, dimensional modeling can be done in the following four steps:
Choosing the business process.
Identifying the Grain.
Designing dimension.s
Designing facts.
In our example, there is only one business process, that is, sales. Grain, as we've described earlier, is the level of detail that will be stored in the Fact table. Based on the requirement, Grain is to have one record per sales transaction and date, per customer, per product, and per store.
Once Grain is defined, it is easy to identify dimensions. Based on the Grain, the dimensions would be date, store, customer, and product. It is useful to name dimensions with a Dim
prefix to identify them easily in the list of tables. So our dimensions will be DimCustomer
, DimProduct
, DimDate
, and DimStore
. The next step is to identify the Fact table, which would be a single Fact table named FactSales
. This table will store the defined Grain.
After identifying the Fact and dimension tables, it's time to go more in detail about each table and think about the attributes of the dimensions, and measures of the Fact table. Next, we will get into the details of the Fact table and then into each dimension.
There is only one Grain for this business process, and this means that one Fact table would be required. Based on the provided Grain, a Fact table would be connected to DimCustomer, DimDate, DimProduct, and DimStore. To connect to each dimension, there would be a foreign key in the Fact table that points to the primary key of the dimension table.
The table would also contain measures or facts. For the sales business process, facts that can be measured (numeric and additive) are SalesAmount, DiscountAmount, and QuantitySold. The Fact table would only contain relationships to other dimensions and measures. The following diagram shows some columns of the FactSales:
As you can see, the preceding diagram shows a star schema. We will go through the dimensions in the next step to explore them more in detail. Fact tables usually don't have too many columns because the number of measures and related tables won't be that much. However, Fact tables will contain many records. The Fact table in our example will store one record per transaction.
As the Fact table will contain millions of records, you should think about the design of this table carefully. The String data types are not recommended in the Fact table because they won't add any numeric or additive value to the table. The relationship between a Fact table and dimensions could also be based on the surrogate key of the dimension. The best practice is to set a data type of surrogate keys as the integer; this will be cost-effective in terms of the required disk space in the Fact table because the integer data type takes only 4 bytes while the string data type is much more. Using an integer as a surrogate key also speeds up the join between a fact and a dimension because join and criteria will be based on the integer that operators works with, which is much faster than a string.
If you are thinking about adding comments in this made by a sales person to the sales transaction as another column of the Fact table, first think about the analysis that you want to do based on comments. No one does analysis based on a free text field; if you wish to do an analysis on a free text, you can categorize the text values through the ETL process and build another dimension for that. Then, add the foreign key-primary key relationship between that dimension to the Fact table.
The customer's information, such as the customer name, customer job, customer city, and so on, will be stored in this dimension. You may think that the customer city is, as another dimension, a Geo dimension. But the important note is that our goal in dimensional modeling is not normalization. So resist against your tendency to normalize tables. For a data warehouse, it would be much better if we store more customer-related attributes in the customer dimension itself rather than designing a snow flake schema. The following diagram shows sample columns of the DimCustomer
table:
The DimCustomer
dimension may contain many more attributes. The number of attributes in your dimensions is usually high. Actually, a dimension table with a high number of attributes is the power of your data warehouse because attributes will be your filter criteria in the analysis, and the user can slice and dice data by attributes. So, it is good to think about all possible attributes for that dimension and add them in this step.
As we've discussed earlier, you see attributes such as Suburb, City, State, and Country inside the customer dimension. This is not a normalized design, and this design definitely is not a good design for a transactional database because it adds redundancy, and making changes won't be consistent. However, for the data warehouse design, not only is redundancy unimportant but it also speeds up analytical queries and prevents snow flaking.
You can also see two keys for this dimension: CustomerKey
and CustomerAlternateKey
. The CustomerKey
is the surrogate key and primary key for the dimension in the data warehouse. The CustomerKey
is an integer field, which is autoincremented. It is important that the surrogate key won't be encoded or taken as a string key; if there is something coded somewhere, then it should be decoded and stored into the relevant attributes. The surrogate key should be different from the primary key of the table in the source system. There are multiple reasons for that; for example, sometimes, operational systems recycle their primary keys, which means they reuse a key value for a customer that is no longer in use to a new customer.
CustomerAlternateKey
is the primary key of the source system. It is important to keep the primary key of the source system stored in the dimension because it would be necessary to identify changes from the source table and apply them into the dimension. The primary key of the source system will be called the business key or alternate key.
The date dimension is one of the dimensions that you will find in most of the business processes. There may be rare situations where you work with a Fact table that doesn't store date-related information. DimDate
contains many generic columns such as FullDate, Month, Year, Quarter, and MonthName. This is obvious as you can fetch all other columns out of the full date column with some date functions, but that will add extra time for processing. So, at the time of designing dimensions, don't think about spaces and add as many attributes as required. The following diagram shows sample columns of the date dimension:
It would be useful to store holidays, weekdays, and weekends in the date dimension because in sales figures, a holiday or weekend will definitely affect the sales transactions and amounts. So, the user will require an understanding of why the sale is higher on a specific date rather than on other days. You may also add another attribute for promotions in this example, which states whether that specific date is a promotion date or not.
The date dimension will have a record for each date. The table, shown in the following screenshot, shows sample records of the date dimension:
As you can see in the records illustrated in the preceding screenshot, the surrogate of the date dimension (DateKey
) shows a meaningful value. This is one of the rare exceptions where we can keep the surrogate key of this dimension as an integer type but with the format of YYYYMMDD to represent a meaning as well.
In this example, if we store time information, where do you think would be the place for time attributes? Inside the date dimension? Definitely not. The date dimension will store one record per day, so a date dimension will have 365 records per year and 3650 records for 10 years. Now, we add time splits to this, down to the last minute, and then we would require 24*60 records per day. So, the combination of the date and time for 10 years would have 3650*24*60= 5265000 records. However, 5 million records for a single dimension are too much; dimensions are usually narrow and they occasionally might have more than one million records. So in this case, the best practice would be to add another dimension as DimTime
and add all time-related attributes in that dimension. The following screenshot shows some example records and attributes of DimTime
:
Usually, the date and time dimensions are generic and static, so you won't be required to populate these dimensions through ETL every night; you just load them once and then you could use them. I've written two general-purpose scripts to create and populate date and time dimensions on my blog that you can use. For the date dimension, visit the http://www.rad.pasfu.com/index.php?/archives/95-Script-for-Creating-and-Generating-members-for-Date-Dimensions-General-Purpose.html URL, and for the time dimension, visit the http://www.rad.pasfu.com/index.php?/archives/122-Script-for-Creating-and-Generating-members-for-Time-Dimension.html URL.
The product dimension will have a ProductKey
, which is the surrogate key, and the business key, which will be the primary key of the product in the source system (something similar to a product's unique number). The product dimension will also have information about the product categories. Again, denormalization in dimensions occurred in this case for the product subcategory, and the category will be placed into the product dimension with redundant values. However, this decision was made in order to avoid snow flaking and raise the performance of the join between the fact and dimensions.
We are not going to go in detail through the attributes of the store dimension. The most important part of this dimension is that it can have a relationship to the date dimension. For example, a store's opening date will be a key related to the date dimension. This type of snow flaking is unavoidable because you cannot copy all the date dimension's attributes in every other dimension that relates to it. On the other hand, the date dimension is in use with many other dimensions and facts. So, it would be better to have a conformed date dimension. Outrigger
is a Kimball terminology for dimensions, such as date, which is conformed and might be used for a many-to-one relationship between dimensions for just one layer.
In the previous example, you learned about transactional fact. Transactional fact is a fact table that has one record per transaction. This type of fact table usually has the most detailed Grain.
There is also another type of fact, which is the snapshot Fact table. In snapshot fact, each record will be an aggregation of some transactional records for a snapshot period of time. For example, consider financial periods; you can create a snapshot Fact table with one record for each financial period, and the details of the transactions will be aggregated into that record.
Transactional facts are a good source for detailed and atomic reports. They are also good for aggregations and dashboards. The Snapshot Fact tables provide a very fast response for dashboards and aggregated queries, but they don't cover detailed transactional records. Based on your requirement analysis, you can create both kinds of facts or only one of them.
There is also another type of Fact table called the accumulating Fact table. This Fact table is useful for storing processes and activities, such as order management. You can read more about different types of Fact tables in The Data Warehouse Toolkit, Ralph Kimball, Wiley (which was referenced earlier in this chapter).
The Factless Fact table – The Bridge table
We've explained that Fact tables usually contain FKs of dimensions and some measures. However, there are times when you would require a Fact table without any measure. These types of Fact tables are usually used to show the non-existence of a fact.
For example, assume that the sales business process does promotions as well, and you have a promotion dimension. So, each entry in the Fact table shows that a customer X purchased a product Y at a date Z from a store S when the promotion P was on (such as the new year's sales). This Fact table covers every requirement that queries the information about the sales that happened, or in other words, for transactions that happened. However, there are times when the promotion is on but no transaction happens! This is a valuable analytical report for the decision maker because they would understand the situation and investigate to find out what was wrong with that promotion that doesn't cause sales.
So, this is an example of a requirement that the existing Fact table with the sales amount and other measures doesn't fulfill. We would need a Fact table that shows that store S did the promotion P on the date D for product X. This Fact table doesn't have any fact or measure related to it; it just has FKs for dimensions. However, it is very informative because it tells us on which dates there was a promotion at specific stores on specific products. We call this Fact table as a Factless Fact table or Bridge table.
Using examples, we've explored the usual dimensions such as customer and date. When a dimension participates in more than one business process and deals with different data marts (such as date), then it will be called a conformed dimension.
Sometimes, a dimension is required to be used in the Fact table more than once. For example, in the FactSales table, you may want to store the order date, shipping date, and transaction date. All these three columns will point to the date dimension. In this situation, we won't create three separate dimensions; instead, we will reuse the existing DimDate
three times as three different names. So, the date dimension literally plays the role of more than one dimension. This is the reason we call such dimensions role-playing dimensions.
There are other types of dimensions with some differences, such as junk dimension and degenerate dimension. The junk dimension will be used for dimensions with very narrow member values (records) that will be in use for almost one data mart (not conformed). For example, the status dimensions can be good candidates for junk dimension. If you create a status dimension for each situation in each data mart, then you will probably have more than ten status dimensions with only less than five records in each. The junk dimension is a solution to combine such narrow dimensions together and create a bigger dimension.
You may or may not use a junk dimension in your data mart because using junk dimensions reduces readability, and not using it will increase the number of narrow dimensions. So, the usage of this is based on the requirement analysis phase and the dimensional modeling of the star schema.
A degenerate dimension is another type of dimension, which is not a separate dimension table. In other words, a degenerate dimension doesn't have a table and it sits directly inside the Fact table. Assume that you want to store the transaction number (string value). Where do you think would be the best place to add that information? You may think that you would create another dimension and enter the transaction number there and assign a surrogate key and use that surrogate key in the Fact table. This is not an ideal solution because that dimension will have exactly the same Grain as your Fact table, and this indicates that the number of records for your sales transaction dimension will be equal to the Fact table, so you will have a very deep dimension table, which is not recommended. On the other hand, you cannot think about another attribute for that dimension because all attributes related to the sales transaction already exist in other dimensions connected to the fact. So, instead of creating a dimension with the same Grain as the fact and with only one column, we would leave that column (even if it is a string) inside the Fact table. This type of dimension will be called a degenerate dimension.
Slowly Changing Dimension
Now that you understand dimensions, it is a good time to go into more detail about the most challengeable concepts of data warehousing, which is slowly changing dimension (SCD). The dimension's attribute values may change depending on the requirement. You will do different actions to respond to that change. As the changes in the dimension's attribute values happen occasionally, this called the slowly changing dimension. SCD depends on the action to be taken after the change is split in different types. In this section, we only discuss type 0, 1, and 2.
Type 0 doesn't accept any changes. Let's assume that the Employee Number is inside the Employee dimension. Employee Number is the business key and it is an important attribute for ETL because ETL distinguishes new employees or existing employees based on this field. So we don't accept any changes in this attribute. This means that type 0 of SCD is applied on this attribute.
Sometimes, a value may be typed wrongly in the source system, such as the first name, and it is likely that someone will come and fix that with a change. In such cases, we will accept the change, and we won't need to keep historical information (the previous name). So we simply replace the existing value with a new value. This type of SCD is called type 1. The following screenshot shows how type 1 works:
In this type, it is a common requirement to maintain historical changes. For example, consider this situation; a customer recently changes their city from Seattle to Charlotte. You cannot use type 0 because it is likely that someone will change their city of living. If you behave like type 1 and update the existing record, then you will miss the information of the customer's purchase at the time that they were in Seattle, and all entries will show that they are customers from Charlotte. So the requirement for keeping the historical version resulted in the third type of SCD, which is type 2.
Type 2 is about maintaining historical changes. The way to keep historical changes is through a couple of metadata columns: FromDate
and ToDate
. Each new customer will be imported into DimCustomer
with FromDate
as a start date, and the ToDate
will be left as null (or a big default value such as 29,990,101). If a change happens in the city, the existing records in ToDate
will be marked as the date of change, and a new record will be created as an exact copy of the previous record with the new city and with a new FromDate
, which will be the date of change, and the ToDate
field will be left as null. Using this solution to find the latest and most up-to-date member information, you just need to look for the member record with ToDate
as null. To fetch the historical information, you would need to search for it in the specified time span whether the historical record exists. The following screenshot shows an example of SCD type 2:
There are other types of SCD that are based on combinations of the first three types and cover other kinds of requirements. You can read more about the different types of SCD and methods of implementing them in The Data Warehouse Toolkit referenced earlier in this chapter.