Data modeling for Analysis Services
If you are reading this book, it means you are using Analysis Services and so you will need to design your data marts with specific features of Analysis Services in mind. This does not mean you should completely ignore the basic theory of data warehouse design and dimensional modeling but, instead, adapt the theory to the practical needs of the product you are going to use as the main interface for querying the data.
For this reason, we are going to present a summary of the theory and discuss how the theoretical design of the data warehouse is impacted by the adoption of Analysis Services.
Fact tables and dimension tables
At the core of the data mart structure is the separation of the entire database into two distinct types of entity:
Di
mension: a dimension is the major analytical object in the BI space. A dimension can be a list of products or customers, time, geography or any other entity used to analyze numeric data. Dimensions are stored in dimension tables.
Dimensions have attributes. An attribute of a product may be its color, its manufacturer or its weight. An attribute of a date may be its weekday or its month.
Dimensions have both natural and surrogate keys. The natural key is the original product code, customer id or real date. The surrogate key is a new integer number used in the data mart as a key that joins fact tables to dimension tables.
Dimensions have relationships with facts. Their reason for being is to add qualitative information to the numeric information contained in the facts. Sometimes a dimension might have a relationship with other dimensions but directly or indirectly it will always be related to facts in some way.
Fa
ct: a fact is something that has happened or has been measured. A fact may be the sale of a single product to a single customer or the total amount of sales of a specific item during a month. From our point of view, a fact is a numeric value that users would like to aggregate in different ways for reporting and analysis purposes. Facts are stored in fact tables.
We normally relate a fact table to several dimension tables, but we do not relate fact tables directly with other fact tables.
Facts and dimensions are related via surrogate keys. This is one of the foundations of Kimball's methodology.
When we build an Analysis Services solution, we build Analysis Services dimension objects from the dimension tables in our data mart and cubes on top of the fact tables. The concepts of facts and dimensions are so deeply ingrained in the architecture of Analysis Services that we are effectively obliged to follow dimensional modeling methodology if we want to use Analysis Services at all.
Star schemas and snowflake schemas
When we define dimension tables and fact tables and create joins between them, we end up with a star schema. At the center of a star schema there is always a fact table. As the fact table is directly related to dimension tables, if we place these dimensions around the fact table we get something resembling a star shape.
In the diagram above we can see that there is one fact table, FactInternetSales, and four dimension tables directly related to the fact table. Looking at this diagram, we can easily understand that a Customer buys a Product with a specific Currency and that the sale takes place in a specific Sales Territory. Star schemas have the useful characteristic that they are easily understandable by anybody at first glance.
Moreover, while the simplicity for human understanding is very welcome, the same simplicity helps Analysis Services understand and use star schemas. If we use star schemas, Analysis Services will find it easier to recognize the overall structure of our dimensional model and help us in the cube design process. On the other hand, snowflakes are harder both for humans and for Analysis Services to understand, and we're much more likely to find that we make mistakes during cube design – or that Analysis Services makes incorrect assumptions when setting properties automatically – the more complex the schema becomes.
Nevertheless, it is not always easy to generate star schemas: sometimes we need (or inexperience causes us) to create a more complex schema that resembles that of a traditional, normalized relational model. Look at the same data mart when we add the Geography dimension:
This is as a "snowflake" schema. If you imagine more tables like DimGeography appearing in the diagram you will see that the structure resembles a snowflake more than the previous star.
The snowflake schema is nothing but a star schema complicated by the presence of intermediate tables and joins between dimensions. The problem with snowflakes is that reading them at first glance is not so easy. Try to answer these simple two questions:
The answers in this case are:
DimGeography is not used to create a new dimension, but is being used to add geographic attributes to the Customer dimension.
DimSalesTerritory is not the territory of the customer but the territory of the order, added during the ETL phase.
The problem is that, in order to answer these questions, we would have to search through the documentation or the ETL code to discover the exact meaning of the fields.
So the simplicity of the star schema is lost when we switch from a star schema to a snowflake schema. Nevertheless, sometimes snowflakes are necessary, but it is very important that – when a snowflake starts to appear in our project – we explain how to read the relationships and what the fields mean.
It might be the case that a snowflake design is mandatory, due to the overall structure of the data warehouse or to the complexity of the database structure. In this case, we have basically these options:
We can use views to transform the underlying snowflake into a star schema. Using views to join tables, it's possible to hide the snowflake structure, persist our knowledge of how the tables in the snowflake should be joined together, and present to Analysis Services a pure star schema. This is—in our opinion—the best approach.
We can use Analysis Services to model joins inside the Data Source View of the project using Named Queries. By doing this, we are relying on Analysis Services to query the database efficiently and recreate the star schema. Although this approach might seem almost equivalent to the use of views in the relational database, in our opinion there are some very good reasons to use views instead of the Data Source View. We discuss these in the section later on in this chapter called Views versus the Data Source View.
We can build Analysis Services dimensions from a set of snowflaked tables. This can have some benefits since it makes it easier for the Dimension Wizard to set up optimal attribute relationships within the dimension, but on the other hand as we've already noted it means we have to remember which columns join to each other every time we build a dimension from these tables. It's very easy to make mistakes when working with complex snowflakes, and to get the error message "the '[tablename]' table that is required for a join cannot be reached based on the relationships in the Data Source View" when we try to process the dimension.
We can leave the snowflake in place and create one Analysis Services dimension for each table, and then use referenced relationships to link these dimensions back to the fact table. Even if this solution seems an interesting one, in our opinion it is the worst.
First of all, the presence of reference dimensions may lead, as we will discuss later, to performance problems either during cube processing or during querying. Additionally, having two separate dimensions in the cube does not give us any benefits in the overall design and may make it less user friendly. The only case where this approach could be advisable is when the dimension is a very complex one: in this case it might be useful to model it once and use reference dimensions where needed. There are some other situations where reference dimensions are useful but they are rarely encountered.
In all cases, the rule of thumb we recommend is the same: keep it simple! This way we'll make fewer mistakes and find it easier to understand our design.
At the end of the dimensional modeling process, we often end up with some attributes that do not belong to any specific dimension. Normally these attributes have a very limited range of values (perhaps three or four values each, sometimes more) and they seem to be not important enough to be considered dimensions in their own right, although obviously we couldn't just drop them from the model altogether.
We have two choices:
Create a very simple dimension for each of these attributes. This will lead to rapid growth in the number of dimensions in the solution, something the users will not like because it makes the cube harder to use.
Merge all these attributes in a so-called "Junk dimension". A junk dimension is simply a dimension that merges together attributes that do not belong anywhere else and share the characteristic of having only a few distinct values each.
The main reasons for the use of a junk dimension are:
If we join several small dimensions into a single junk dimension, we will reduce the number of fields in the fact table. For a fact table of several million rows this can represent a significant reduction in the amount of space used and the time needed for cube processing.
Reducing the number of dimensions will mean Analysis Services performs better during the aggregation design process and during querying, thereby improving the end user experience.
The end user will never like a cube with 30 or more dimensions: it will be difficult to use and to navigate. Reducing the number of dimensions will make the cube less intimidating.
However, there is one big disadvantage in using a junk dimension: whenever we join attributes together into a junk dimension, we are clearly stating that these attributes will never have the rank of a fully-fledged dimension. If we ever change our mind and need to break one of these attributes out into a dimension on its own we will not only have to change the cube design, but we will also have to reprocess the entire cube and run the risk that any queries and reports the users have already created will become invalid.
Degenerate dimensions are created when we have columns on the fact table that we want to use for analysis but which do not relate to any existing dimension. Degenerate dimensions often have almost the same cardinality as the fact table; a typical example is the transaction number for a point of sale data mart. The transaction number may be useful for several reasons, for example to calculate a "total sold in one transaction" measure.
Moreover, it might be useful if we need to go back to the OLTP database to gather other information. However, even if it is often a requested feature, users should not be allowed to navigate sales data using a transaction number because the resulting queries are likely to bring back enormous amounts of data and run very slowly. Instead, if the transaction number is ever needed, it should be displayed in a specifically-designed report that shows the contents of a small number of transactions.
Keep in mind that, even though the literature often discusses degenerate dimensions as separate entities, it is often the case that a big dimension might have some standard attributes and some degenerate ones. In the case of the transaction number, we might have a dimension holding both the transaction number and the Point Of Sale (POS) number. The two attributes live in the same dimension but one is degenerate (the transaction number) and one is a standard one (the POS number). Users might be interested in slicing sales by POS number and they would expect good performance when they did so; however, they should not be encouraged to slice by transaction number due to the cardinality of the attribute.
From an Analysis Services point of view, degenerate dimensions are no different to any other dimension. The only area to pay attention to is the design of the attributes: degenerate attributes should not be made query-able to the end user (you can do this by setting the attribute's AttributeHierarchyEnabled
property to False) for the reasons already mentioned. Also, for degenerate dimensions that are built exclusively from a fact table, Analysis Services has a specific type of dimension relationship type called Fact. Using the Fact relationship type will lead to some optimizations being made to the SQL generated if ROLAP storage is used for the dimension.
Slowly Changing Dimensions
Dimensions change over time. A customer changes his/her address, a product may change its price or other characteristics and – in general – any attribute of a dimension might change its value. Some of these changes are useful to track while some of them are not; working out which changes should be tracked and which shouldn't can be quite difficult though.
Changes should not happen very often. If they do, then we might be better off splitting the attribute off into a separate dimension. If the changes happen rarely, then a technique known as Slowly Changing Dimensions (SCDs) is the solution and we need to model this into our dimensions.
SCDs come in three flavors:
Type 1: We maintain only the last value of each attribute in the dimension table. If a customer changes address, then the previous one is lost and all the previous facts will be shown as if the customer always lived at the same address.
Type 2: We create a new record in the dimension table whenever a change happens. All previous facts will still be linked to the old record. Thus, in our customer address example the old facts will be linked to the old address and the new facts will be linked to the new address.
Type 3: If what we want is simply to know the "last old value" of a specific attribute of a dimension, we can add a field to the dimension table in order to save just the "last value of the attribute" before updating it. In the real world, this type of dimension is used very rarely.
The SCD type used is almost never the same across all the dimensions in a project. We will normally end up with several dimensions of type 1 and occasionally with a couple of dimensions of type 2. Also, not all the attributes of a dimension have to have the same SCD behavior. History is not usually stored for the date of birth of a customer, if it changes, since the chances are that the previous value was a mistake. On the other hand, it's likely we'll want to track any changes to the address of the same customer. Finally, there may be the need to use the same dimension with different slowly changing types in different cubes. Handling these changes will inevitably make our ETL more complex.
Type 1 dimensions are relatively easy to handle and to manage: each time we detect a change, we apply it to the dimension table in the data mart and that is all the work we need to do.
Type 2 dimensions are more complex: when we detect a change, we invalidate the old record by setting its "end of validity date" and insert a new record with the new values. As all the new data will refer to the new record, it is simple to use in queries. We should have only one valid record for each entity in the dimension.
The modeling of SCDs in Analysis Services will be covered later but, in this theoretical discussion, it might be interesting to spend some time on the different ways to model Type 2 SCDs in the relational data mart.
A single dimension will hold attributes with different SCD types since not all the attributes of a single dimension will need to have historical tracking. So, we will end up with dimensions with some Type 1 attributes and some Type 2 attributes. How do we model that in the data mart?
We have basically these choices:
We can build two dimensions: one containing the Type 2 attributes and one containing the Type 1 attributes. Obviously, we will need two different dimension tables in the data mart to do this.
This solution is very popular and is easy to design but has some serious drawbacks:
The number of dimensions in the cube is much larger. If we do this several times, the number of dimensions might reach the point where we have usability problems.
If we need to run queries that include both Type 1 and Type 2 attributes, Analysis Services has to resolve the relationship between the two dimensions via the fact table and, for very big fact tables, this might be very time-consuming. This issue is not marginal because, if we give users both types of attribute, they will always want to use them together in queries.
We can build a complex dimension holding both the Type 1 and Type 2 values in a single dimension table. This solution will lead to much more complex ETL to build the dimension table but solves the drawbacks of the previous solution. For example, having both the Type 1 and Type 2 attributes in a single dimension can lead to better query performance when comparing values for different attributes, because the query can be resolved at the dimension level and does not need to cross the fact table. Also, as we've stressed several times already, having fewer dimensions in the cube makes it much more user-friendly.
Bridge tables, or factless fact tables
We can use the terms bridge table and factless fact table interchangeably – they both refer to the same thing, a table that is used to model a many-to-many relationship between two dimension tables. Since the name factless fact table can be misleading, and even if the literature often refers to these tables as such, we prefer the term bridge table instead.
Note
All fact tables represent many-to-many relationships between dimensions but, for bridge tables, this relationship is their only reason to exist: they do not contain any numeric columns – facts – that can be aggregated (hence the use of the name 'factless fact table'). Regular fact tables generate many-to-many relationships as a side effect, as their reason for being is the nature of the fact, not of the relationship.
Now, let us see an example of a bridge table. Consider the following situation in an OLTP database
In any given period of time, a product can be sold on special offer. The bridge table (SpecialOfferProduct) tells us which products were on special offer at which times, while the SpecialOffer table tells us information about the special offer itself: when it started, when it finished, the amount of discount and so on.
A common way of handling this situation is to denormalize the special offer information into a dimension directly linked to the fact table, so we can easily see whether a specific sale was made under special offer or not. In this way, we can use the fact table to hold both the facts and the bridge. Nevertheless, bridge tables offer a lot of benefits and, in situations like this, they are definitely the best option. Let's take a look at the reasons why.
It is interesting to consider whether we can represent the relationship in the example above only using fact tables (that is, storing three types of data for each sale: product, sale and special offer) or whether a bridge table is necessary. While the first option is certainly correct, we need to think carefully before using it because if we do use it all data on special offers that did not generate any sales will be lost. If a specific special offer results in no product sales, then we aren't storing the relationship between the special offer and the product anywhere—it will be exactly as though the product had never been on special offer. This is because the fact table does not contain any data that defines the relationship between the special offers and the products, it only knows about this relationship when a sale is made. This situation may lead to confusion or incorrect reports. We always need to remember that the absence of a fact may be as important as its presence is. Indeed, sometimes the absence of a fact is more important than its presence.
We recommend using bridge tables to model many-to-many relationships that do not strictly depend on facts to define the relationship. The relationships modeled by many-to-many relationships are often not bound to any fact table and exist regardless of any fact table. This shows the real power of bridge tables but, as always, the more power we have the bigger our responsibilities will be, and bridge tables will sometimes cause us headaches.
Bridge tables are modeled in Analysis Services as measure groups that act as bridges between different dimensions, through the many-to-many dimension relationship type, one of the most powerful features of Analysis Services. This feature will be analyzed in greater detail in Chapter 6.
Snapshot and transaction fact tables
Now that we have defined what a fact table is, let us go deeper and look at the two main types: transaction fact tables and snapshots.
A transaction fact table records events and, for each event, certain measurements are recorded or values recorded. When we record a sale, for example, we create a new row in the transaction fact table that contains information relating to the sale such as what product was sold, when the sale took place, what the value of the sale was, and so on.
A snapshot fact table records of the state of something at different points in time. If we record in a fact table the total sales for each product every month, we are not recording an event but a specific situation. Snapshots can also be useful when we want to measure something not directly related to any other fact. If we want to rank out customers based on sales or payments, for example, we may want to store snapshots of this data in order to analyze how these rankings change over time in response to marketing campaigns.
Using a snapshot table containing aggregated data instead of a transaction table can drastically reduce the number of rows in our fact table, which in turn leads to smaller cubes, faster cube processing and faster querying. The price we pay for this is the loss of any information that can only be stored at the transaction level and cannot be aggregated up into the snapshot, such as the transaction number data we encountered when discussing degenerate dimensions. Whether this is an acceptable price to pay is a question only the end users can answer.
Updating fact and dimension tables
In an ideal world, data that is stored in the data warehouse would never change. Some books suggest that we should only support insert operations in a data warehouse, not updates: data comes from the OLTP, is cleaned and is then stored in the data warehouse until the end of time, and should never change because it represents the situation at the time of insertion.
Nevertheless, the real world is somewhat different to the ideal one. While some updates are handled by the slowly changing dimension techniques already discussed, there are other kinds of updates needed in the life of a data warehouse. In our experience, these other types of update in the data warehouse are needed fairly regularly and are of two main kinds:
S
tructural updates: when the data warehouse is up and running, we will need to perform updates to add information like new measures or new dimension attributes. This is normal in the lifecycle of a BI solution.
Data updates: we need to update data that has already been loaded into the data warehouse, because it is wrong. We need to delete the old data and enter the new data, as the old data will inevitably lead to confusion. There are many reasons why bad data comes to the data warehouse; the sad reality is that bad data happens and we need to manage it gracefully.
Now, how do these kinds of updates interact with fact and dimension tables? Let's summarize briefly what the physical distinctions between fact and dimension tables are:
Dimension tables are normally small, usually with less than 1 million rows and very frequently much less than that.
Fact tables are often very large; they can have up to hundreds of millions or even billions of rows. Fact tables may be partitioned, and loading data into them is usually the most time-consuming operation in the whole of the data warehouse.
Structural updates on dimension tables are very easy to make. You simply update the table with the new metadata, make the necessary changes to your ETL procedures and the next time they are run the dimension will reflect the new values. If your users decide that they want to analyze data based on a new attribute on, say, the customer dimension, then the new attribute can be added for all of the customers in the dimension. Moreover, if the attribute is not present for some customers, then they can be assigned a default value; after all, updating one million rows is not a difficult task for SQL Server or any other modern relational database. However, even if updating the relational model is simple, the updates need to go through to Analysis Services and this might result in the need for a full process of the dimension and therefore the cube, which might be very time consuming.
On the other hand, structural updates may be a huge problem on fact tables. The problem is not that of altering the metadata, but determining and assigning a default value for the large number of rows that are already stored in the fact table. It's easy to insert data into fact tables. However, creating a new field with a default value would result in an UPDATE
command that will probably run for hours and might even bring down your database server. Worse, if we do not have a simple default value to assign, then we will need to calculate the new value for each row in the fact table, and so the update operation will take even longer. We have found that it is often better to reload the entire fact table rather than perform an update on it. Of course, in order to reload the fact table, we need to have all of our source data at hand and this is not always possible.
Data updates are an even bigger problem still, both on facts and dimensions. Data updates on fact tables suffer from the same problems as adding a new field: often, the number of rows that we need to update is so high that running even simple SQL commands can take a very long time.
Data updates on dimensions can be a problem because they may require very complex logic. Suppose we have a Type 2 SCD and that a record was entered into the dimension table with incorrect attribute values. In this situation, we would have created a new record and linked all the facts received after its creation to the new (and incorrect) record. Recovering from this situation requires us to issue very precise UPDATE
statements to the relational database and to recalculate all the fact table rows that depend – for any reason – on the incorrect record. Bad data in dimensions is not very easy to spot, and sometimes several days – if not months – pass before someone (in the worst case the user) discovers that something went wrong.
There is no foolproof way for stopping bad data getting into the data warehouse. When it happens, we need to be ready to spend a long time trying to recover from the error. It's worth pointing out that data warehouses or data marts that are rebuilt each night ("one shot databases") are not prone to this situation because, if bad data is corrected, the entire data warehouse can be reloaded from scratch and the problem fixed very quickly. This is one of the main advantages of "one shot" data warehouses, although of course they do suffer from several disadvantages too such as their limited ability to hold historic data.
Natural and surrogate keys
In Kimball's view of a data mart, all the natural keys should be represented with a surrogate key that is a simple integer value and that has no meaning at all. This gives us complete freedom in the data mart to add to or redefine a natural key's meaning and, importantly, the usage of the smallest possible integer type for surrogate keys will lead to a smaller fact table.
All this is very good advice. Nevertheless, there are situations in which the rules surrounding the usage of surrogate keys should be relaxed or to put it another way—there can be times when it's useful to make the surrogate keys meaningful instead of meaningless. Let's consider some of the times when this might be the case:
Date: we can use a meaningless key as a surrogate key for the Date dimension. However, is there any point in doing so? In our opinion, the best representation of a date surrogate key is an integer in the form YYYYMMDD, so 20080109 represents January 9th 2008. Note that even the Kimball Group, writing in the book The Microsoft Data Warehouse Toolkit, accept that this can be a good idea. The main reason for this is that it makes SQL queries that filter by date much easier to write and much more readable – we very often want to partition a measure group by date, for instance. The reason that it's safe to do this is that the date dimension will never change. You might add some attributes to a date dimension table and you might load new data into it, but the data that is already there should never need to be altered.
All invalid dates may be easily represented with negative numbers, so -1 may be the unknown date, -2 may be the empty date and so on. We will have plenty of space for all the dummy dates we will ever need.
A word of warning about the type of the key: we sometimes face situations where the DateTime
type has been used for the key of the Date dimension. This is absolutely the wrong thing to do, as not only is a DateTime
representation going to be bigger than the INT
representation, the DateTime
type does not let us add dummy values to the dimension easily.
Ranges: Suppose we want a dimension that will rank the sales based on the amount of the specific sale. We want to analyze information based on a range, not on each single amount.
If we define a "Range Amount" dimension and an ID_RangeAmount
key in the fact table, this will solve our modeling problem. However, what will happen when the customer wants to change the ranges? We will have to re-compute the whole fact table because the ID_RangeAmount
key will become useless.
On the other hand, if you decide that $100 will be the granularity of the range dimension, you can use FLOOR
(Amount / 100) as ID_RangeAmount
and, in doing so, you will be able to update the attributes of the RangeAmount dimension that will lead to hierarchies without updating the fact table. The advantages of doing this are discussed in more detail in the following blog entry: http://tinyurl.com/rangedims, and we discuss how to model this in Analysis Services in Chapter 4.
The surrogate key of a dimension will be surfaced in MDX in the unique names generated for members on the key attribute of the dimension. Since all Analysis Services client tools save their queries using these unique names, this means that if for any reason we change the surrogate key of a member in the future (for example, because we have a "one shot" data warehouse that is reloaded each night), the saved queries will no longer work.
Junk dimensions: junk dimensions, when defined at the conformed dimension level, can cause problems when there is the need to update data. If, for example, we need a new attribute on a junk dimension, this can involve a change in the meaning of the existing members of the junk dimension. This will invalidate the older facts, requiring a careful rebuilding of the junk dimension. In our opinion it is better to maintain the junk dimensions as separate dimensions in the relational data mart and then merge them into a single dimension for Analysis Services, creating a key attribute that uses a composite key made up of the keys of all the dimension tables that make up the junk dimension.
Therefore, the conclusion is that although surrogate keys are very useful and we are not saying that there is something wrong with them, in some well-defined situations it makes sense to deviate from the standard recommendations for surrogate keys and use different forms instead.
The last consideration is that – even in cases where we deviate from the standard – the usage of surrogate keys of the smallest integer type possible is always strongly advised. The Analysis Services engine is optimized for the handling of integer values and does not handle string and date values anywhere near as well.
Unknown members, key errors, and NULLability
When designing a data mart, questions often arise about the relationship between the fact table and the dimensions. Should the foreign keys be NULLable or not? Should we use the built-in foreign keys of SQL Server to handle the relationship? What about key errors?
Since these are very interesting topics, let us discuss them in more detail.
Can we use NULLable foreign keys columns? The answer is definitely no. If we do, when the data is moved into the cube, there will be no relationship between the dimension and the facts containing the NULL key, leading to either processing errors or situations where the user will see partial data. It is much better to add a new member to the dimension, and relate the facts with missing key values to that member in our ETL. Although the Unknown Members feature in Analysis Services does this for us, we will have more flexibility if we handle the issue ourselves.
Should we use SQL Server FOREIGN KEYS
? The correct technique to adopt here is as follows: we should define the foreign keys in the data model and activate them during debugging, in order to detect errors. When the system is in production, the foreign key can be disabled in order to speed up the ETL code.
Key errors: there should be no key errors at all. If we enable foreign key checking then we will end up with a table without key errors. This is important: even though we can leverage Analysis Services' key error handling functionality during processing, we advise not using it because the presence of a key error should be detected and resolved during the ETL process, as with NULL keys.