Physical database design for Analysis Services
Apart from the issue of modeling data in an appropriate way for Analysis Services, it's also important to understand how details of the physical implementation of the relational data mart can be significant too.
Multiple data sources
All of the dimension and fact tables you intend to use should exist within the same relational data source, so for example, if you're using SQL Server, this means all the tables involved should exist within the same SQL Server database. If you create multiple data sources within Analysis Services, then you'll find that one is treated as the primary data source; this has to point to an instance of SQL Server (either SQL Server 2000 or above) and all data from other data sources is fetched via the primary data source using the SQL Server OPENROWSET
function, which can lead to severe processing performance problems.
Data types and Analysis Services
When we design data marts, we need to be aware that Analysis Services does not treat all data types the same way. The cube will be much faster, for both processing and querying, if we use the right data type for each column. Here, we provide a brief table that helps us during the design of the data mart, to choose the best data type for each column type:
Fact column type |
Fastest SQL Server data types |
---|---|
Surrogate keys |
|
Date key |
|
Integer measures |
|
Numeric measures |
(Note that decimal and vardecimal require more CPU power to process than money and float types) |
Distinct count columns |
(If your count column is |
Clearly, we should always try to use the smallest data type that will be able to hold any single value within the whole range of values needed by the application.
Note
This is the rule for relational tables. However, you also need to remember that the equivalent measure data type in Analysis Services must be large enough to hold the largest aggregated value of a given measure, not just the largest value present in a single fact table row.
Always remember that there are situations in which the rules must be overridden. If we have a fact table containing 20 billion rows, each composed of 20 bytes and a column that references a date, then it might be better to use a SMALLINT
column for the date if we find a suitable representation that holds all necessary values. We will gain 2 bytes for each row, and that means a 10 percent reduction in the size of the whole table.
SQL queries generated during cube processing
When Analysis Services needs to process a cube or a dimension, it sends queries to the relational database in order to retrieve the information it needs. Not all the queries are simple SELECT
s, there are many situations in which Analysis Services generates complex queries. Even if we do not have space enough to cover all scenarios, we're going to provide some examples relating to SQL Server, and we advise the reader to have a look at the SQL queries generated for their own cube to check whether they can be optimized in some way.
Dimension processing
During dimension processing, Analysis Services sends several queries, one for each attribute of the dimension, in the form of SELECT DISTINCT ColName
, where ColName
is the name of the column holding the attribute.
Many of these queries are run in parallel (exactly which ones can be run in parallel depends on the attribute relationships defined on the Analysis Services dimension), so SQL Server will take advantage of its cache system and perform only one physical read of the table, so that all successive scans are performed from memory. Nevertheless, keep in mind that the task of detecting the DISTINCT
values of the attributes is done by SQL Server, not Analysis Services.
We also need to be aware that, if our dimensions are built from complex views, they might confuse the SQL Server engine which will not be able to perform at its best. If, for example, we add a very complex WHERE
condition to our view, then the condition will be evaluated more than once. We have personally seen a situation where the processing of a simple Time
dimension with only a few hundred rows, which had a very complex WHERE
condition, took tens of minutes to complete. If the views uses to feed a dimension are too complex, this normally means that the ETL phase is not populating the dimensions in the right way and it would be better to review the ETL and produce physical tables which are easier to query during cube processing.
Dimensions with joined tables
If a dimension contains attributes that come from a joined table, the JOIN
is performed by SQL Server, not Analysis Services. This situation arises very frequently when we define snowflakes instead of simpler star schemas. Since some attributes of a dimension are computed by taking their values from another dimension table, Analysis Services will send a query to SQL Server containing the INNER JOIN
between the two tables.
As long as all the joins are made on the primary keys, this will not lead to any problems, but in cases where the JOIN
is not made on the primary key, bad performance might result. As we said before, if we succeed in the goal of exposing a simple star schema to Analysis Services, we will never have to handle these JOIN
s. As we argue next, if a snowflake is really needed, we can still hide it from Analysis Services using views, and in these views we will have full control over and knowledge of the complexity of the query used.
Reference dimensions
Reference dimensions, when present in the cube definition, will lead to one of the most hidden and most dangerous types of JOIN
. When we define the relationship between a dimension and a fact table, we can use the Referenced
relationship type and use an intermediate dimension to relate the dimension to the fact table. Reference dimensions often appear in the design due to snowflakes or due to the need to reduce fact table size.
A referenced dimension may be materialized or not. If we decide to materialize a reference dimension (as SQL Server Data Tools will suggest) the result is that the fact table query will contain a JOIN
to the intermediate dimension, to allow Analysis Services to get the value of the key for the reference dimension.
If JOIN
s are a problem with dimension queries, they are a serious problem with fact queries. It might be the case that SQL Server needs to write a large amount of data to its temporary database before returning information to Analysis Services. It all depends on the size of the intermediate table and the number of reference dimensions that appear in the cube design.
We are not going to say that referenced dimensions should not be used at all, as there are a few cases where reference dimensions are useful, and in the following chapters we will discuss them in detail. Nevertheless, we need to be aware that reference dimensions might create complex queries sent to SQL server and this can cause severe performance problems during cube processing.
Fact dimensions
The processing of dimensions related to measure group with a fact relationship type, usually created to hold degenerate dimensions, is performed in the same way as any other dimension. This means that SELECT DISTINCT
will be issued on all the degenerate dimension's attributes.
Clearly, as the dimension and the fact tables are the same, the query will ask for DISTINCT
over a fact table; given that fact tables can be very large, the query might take a long time to run. Nevertheless, if a degenerate dimension is needed and it is stored in a fact table, then there is no other choice but to pay the price with this query.
Distinct Count measures
The last kind of query that we need to be aware of is when we have a measure group containing a DISTINCT COUNT
measure. In this case, due to the way Analysis Services calculates distinct counts, the query to the fact table will be issued with ORDER BY
for the column we are performing the distinct count on.
Needless to say, this will lead to very poor performance because we are asking SQL Server to sort a fact table on a column that is not part of the clustered index (usually, the clustered index is built on the primary key). The pressure on the temporary database will be tremendous and the query will take a lot of time.
There are some optimizations, mostly pertinent to partitioning, that need to be done when we have DISTINCT COUNT
measures in very big fact tables. What we want to point out is that in this case a good knowledge of the internal behavior of Analysis Services is necessary in order to avoid bad performance when processing.
Indexes in the data mart
The usage of indexes in data mart is a very complex topic and we cannot cover it all in a simple section. Nevertheless, there are a few general rules that can be followed for both fact and dimension tables.
Dimension tables
Dimension tables should have a primary clustered key based on an integer field, which is the surrogate key.
Non-clustered indexes may be added for the natural key, in order to speed up the ETL phase for Slowly Changing Dimensions. The key might be composed of the natural key and the slowly changing dimension date of insertion. These indexes might be defined as UNIQUE
, but like any other constraint in the data mart, the uniqueness should be enforced in development and disabled in production.
Fact tables
It is questionable whether fact tables should have a primary key or not. We prefer to have a primary clustered key based on an integer field, because it makes it very simple to identify a row in the case where we need to check for its value or update it.
In the case where the table is partitioned by date, the primary key will be composed of the date and the integer surrogate key, to be able to meet the needs of partitioning.
If a column is used to create a DISTINCT COUNT
measure in a cube, then it might be useful to have that column in the clustered index, because Analysis Services will request an ORDER BY
clause on that column during the process of the measure group. It is clear that the creation of a clustered index is useful in large cubes where data is added incrementally, so processing will benefit from the ordered data. If on the other hand, we have a one-shot solution where all tables are reloaded from scratch and the cube is fully processed, then it is better to avoid the creation of a clustered index since the sorting of the data is performed only once, during cube processing.
Once the cube has been built, if MOLAP storage is being used, no other indexes are useful. However if the data mart is queried by other tools such as Reporting Services, or if ROLAP partitions are created in Analysis Services, then it might be necessary to add more indexes to the tables. Remember, though, that indexes slow down update and insert operations so they should be added with care. A deep analysis of the queries sent to the relational database will help to determine the best indexes to create.
Usage of schemas
The data warehouse is normally divided into subject areas. The meaning of a subject area really depends on the specific needs of the solution. Typical subject areas include:
Sales
Accounts
Warehouses
Suppliers
Personnel and staff management
Clearly, this list is far from complete and is different for every business. SQL Server provides schemas to arrange tables and—in our experience—the usage of schemas to assign database objects to subject areas leads to a very clear database structure.
Some tables will inevitably have no place at all in any subject area, but we can always define a common subject area to hold all these tables.
Naming conventions
A clear and consistent naming convention is good practice for any kind of relational database and a data mart is no different. As well as making the structure more readable, it will help you when you come to build your cube because SQL Server Data Tools will be able to work out automatically which columns in your dimension and fact tables should join to each other if they have the same names.
Views versus the Data Source View
The Data Source View (DSV, from now on) is one of the places where we can create an interface between Analysis Services and the underlying relational model. In the DSV, we can specify joins between tables and we can create named queries and calculations to provide the equivalent of views and derived columns. It's very convenient for the cube developer to open up the DSV in SQL Server Data Tools and make these kind of changes.
This is all well and good, but nevertheless our opinion about the DSV is clear: it is almost too powerful and, using its features, we risk turning a clean, elegant structure into a mess. It is certainly true that there is the need for an interface between the relational model of the database and the final star schema, but we don't think it's a good idea to use the DSV for this purpose.
SQL Server gives us a much more powerful and easy-to-use tool to use instead: SQL Views. The following is a list of some of the reasons why we prefer to use views instead of the DSV:
Views are stored where we need them: When we need to read the specification of an interface, we want to be able to do it quickly. Views are stored in the database, exactly where we want them to be. If we need to modify the database, we want to be able to find all of the dependencies easily, and using views, we have a very easy way of tracking dependencies.
If we use the DSV, we are hiding these dependencies from the database administrator, the person who needs to be able to update and optimize the data mart. In addition, there are tools on the market that can analyze dependencies between table and views. It is not easy to do this if information on the joins between tables is stored outside the database.
We can easily change column names in views: In the database, we might have
SoldQty
as a field in a table. This is good because it is concise and does not contain useless spaces. In the cube, we want to show it asQuantity Sold
simply because our user wants a more descriptive name.Views are a very useful means of changing names when needed. In turn, with views we are publicly declaring the name change so that everybody will easily understand that a specific field with a name in one level is—in reality—a field that has another name in the previous level.
Clearly, we should avoid the practice of changing names at each level. As always, having the opportunity to do something does not mean that we need to do it.
We can perform simple calculations in views easily: If we need to multiply the value of two columns, for example,
Qty * Price
, to use in a measure in our cube we have two options. We can perform the calculation in the DSV, but as earlier, we are hiding the calculation in a Visual Studio project and other people will not be able to see what we're doing easily. If we perform the calculation in a view, then other developers can reuse it, and tracking dependencies is more straightforward.This is certainly true for simple calculations. On the other hand, if we're performing complex calculations in views then we are probably missing some transformation code in our ETL. Moreover, performing this calculation will waste time when we execute the view. Performing the calculation during ETL will mean we compute the value only once; from then it will always be available.
Views are made up of plain text: We can easily search for all the occurrences of a specific column, table, or any kind of value using a simple text editor. We do not need any specialized development tools or to dive into unreadable XML code to have a clear view of how a specific field is used.
If we need to update a view, we can do it without opening SQL Server Data Tools. This means that nearly everybody can do it, although, as it is very easy to update a view, some sort of security does need to be applied.
Furthermore, as views are simple text, a source control system can handle them very easily. We can check who updated what, when they did it, and what they changed very easily.
Views can be updated very quickly: A view can be updated very quickly as it does not require any kind of processing; we just
ALTER
it and the work is done. We do not need to use anUPDATE
statement if we want to make simple (and possibly temporary) changes to the data.Views can reduce the number of columns we expose: There is really no need to expose more columns to a cube than it needs. Showing more columns will only lead to confusion and a chance that the wrong column will be used for something.
Views can provide default values when needed: When we have a NULLable column that contains
NULL
values, we can easily assign a default value to it using views. We shouldn't really have a NULLable column in a data mart, but sometimes it happens.Views can expose a star schema even if the relational model is more complex: As we've already mentioned, sometimes we end up with a relational design that is not a perfectly designed star schema. By removing unused columns, by creating joins when necessary, and in general by designing the appropriate queries, we can expose Analysis Services to a star schema, even when the relational model has a more complex structure.
Views are database objects: As views are database objects, they inherit two important properties:
We can configure security for views, and so stop unauthorized access to data very easily.
Views can belong to a schema. If we are using schemas for the definition of subject areas, we can assign views to subject areas. This will lead to a very clean project, where each object belongs to the subject area that is relevant to it.
Views can be optimized: With views we can use hints to improve performance. For example, we can use the
NOLOCK
hint to avoid locking while reading from tables; although, of course, removing locking leads to the possibility of dirty reads, and it is up to the developer to decide whether doing this is a good idea or not. Moreover, we can analyze the execution path of a view in order to fine tune it. All this can be done without affecting in any way the Analysis Services project.Note
One very important point needs to be stressed: views should not be used as a substitute for proper ETL. Whenever views are used to feed Analysis Services, they should not contain complex calculations or
WHERE
clauses as this can lead to serious processing performance and maintenance problems. We can use a view instead of ETL code for prototyping purposes, but this is very bad practice in a production system.