The source database
Analysis Services cubes are built on top of a database, but the real question is: what kind of database should this be?
We try to answer this question by analyzing the different kinds of databases we encounter in our search for the best source for our cube. In the process of doing so, we describe the basics of dimensional modeling, as well as some of the competing theories on how data warehouses should be designed.
The OLTP database
Typically, you create a BI solution when business users want to analyze, explore, and report on their data in an easy and convenient way. The data itself may be composed of thousands, millions, or even billions of rows, normally kept in a relational database built to perform a specific business purpose. We refer to this database as the On Line Transactional Processing (OLTP) database.
The OLTP database can be a legacy mainframe system, a CRM system, an ERP system, a general ledger system, or any kind of database that a company uses in order to manage their business.
Sometimes, the source OLTP may consist of simple flat files generated by processes running on a host. In such a case, the OLTP is not a real database, but we can still turn it into one by importing the flat files into a SQL Server database for example. Therefore, regardless of the specific media used to store the OLTP, we will refer to it as a database.
Some of the most important and common characteristics of an OLTP system are:
The OLTP system is normally a complex piece of software that handles information and transactions. From our point of view, though, we can think of it simply as a database.
We do not normally communicate in any way with the application that manages and populates the data in the OLTP. Our job is that of exporting data from the OLTP, cleaning it, integrating it with data from other sources, and loading it into the data warehouse.
We cannot make any assumptions, such as a guarantee in the schema structure over time, about the OLTP database's structure.
Somebody else has built the OLTP system and is probably currently maintaining it, so its structure may change over time. We do not usually have the option of changing anything in its structure anyway, so we have to take the OLTP system "as is" even if we believe that it could be made better.
The OLTP may well contain data that does not conform to the general rules of relational data modeling, such as foreign keys and constraints.
Normally in the OLTP system, you find historical data that is not correct. This is almost always the case. A system that runs for years very often has data that is incorrect and never will be corrected.
When building a BI solution we have to clean and fix this data, but normally it would be too expensive and disruptive to do this for old data in the OLTP system itself.
In our experience, the OLTP system is very often poorly documented. Our first task is, therefore, that of creating good documentation for the system, validating data, and checking it for any inconsistencies.
The OLTP database is not built to be easily queried for analytical workloads, and is certainly not going to be designed with Analysis Services cubes in mind. Nevertheless, a very common question is: "do we really need to build a dimensionally modeled data mart as the source for an Analysis Services cube?", and the answer is a definite "yes"!
As we'll see, the structure of a data mart is very different from the structure of an OLTP database and Analysis Services is built to work on data marts, not on generic OLTP databases. The changes that need to be made when moving data from the OLTP database to the final data mart structure should be carried out by specialized ETL software, such as SQL Server Integration Services, and cannot simply be handled by Analysis Services in the Data Source View.
Moreover, the OLTP database needs to be efficient for OLTP queries. OLTP queries tend to be very fast on small chunks of data, in order to manage everyday work. If we run complex queries ranging over the whole OLTP database, as BI-style queries often do, we will create severe performance problems for the OLTP database. There are very rare situations in which data can flow directly from the OLTP through to Analysis Services, but these are so specific that their description is outside the scope of this book.
Beware of the temptation to avoid building a data warehouse and data marts. Building an Analysis Services cube is a complex job that starts with getting the design of your data mart right. If we have a dimensional data mart, we have a database that holds dimension and fact tables that can contain transformed, cleansed, and calculated information. If, on the other hand, we rely on the OLTP database, we might finish our first cube in less time, but our data will be dirty, inconsistent, and unreliable, and cube processing will be slow. In addition, we will not be able to create complex relational models to accommodate our users' analytical needs.
The data warehouse
We always have an OLTP system as the original source of our data, but when it comes to the data warehouse, it is can be difficult to answer the apparently simple question of whether we have a data warehouse. The problem is not the answer, as every analyst will happily reply "Yes, we do have a data warehouse"; the problem is in the meaning of the term "data warehouse".
There are at least two major approaches to data warehouse design and development and consequently, to the definition of what a data warehouse is. They are described in the books of the following two leading authors:
Ralph Kimball: If we are building a Kimball data warehouse, we build fact tables and dimension tables structured as data marts. We will end up with a data warehouse composed of the sum of all the data marts.
Bill Inmon: If our choice is that of an Inmon data warehouse, then we design a (somewhat normalized) physical relational database that will hold the data warehouse. Afterwards, we produce departmental data marts with their star schemas populated from that relational database.
If this were a book about data warehouse methodology, then we could write hundreds of pages about this topic, but luckily for the reader, the detailed differences between the Inmon and Kimball methodologies are out of the scope of this book. Readers can find out more about these methodologies in Building the Data Warehouse by Bill Inmon and The Data Warehouse Toolkit by Ralph Kimball. Both books should be present on any BI developer's bookshelf.
A picture is worth a thousand words when trying to describe the differences between the two approaches. In Kimball's bus architecture, data flows from the OLTP through to the data marts, as shown in the following diagram:
In contrast, in Inmon's view, data coming from the OLTP systems needs to be stored in the enterprise data warehouse and from there, goes to the data marts, as shown in the following diagram:
What is important is to understand is that the simple phrase "data warehouse" has different meanings in each of these methodologies. Finally, it is worth mentioning that in the recent years several authors proposed combined methodologies (see, for an example, http://www.sqlbi.com/articles/sqlbi-methodology).
We will adopt Inmon's meaning for the term data warehouse. This is because in Inmon's methodology, the data warehouse is a real database, while in Kimball's view, the data warehouse is composed of integrated data marts. For the purposes of this chapter, though, what is really important is the difference between the data warehouse and the data mart. At the end, the source of an SSAS cube needs to be a data mart, maybe composed of many fact tables and dimensions, but definitely a data mart.
The data mart
Whether you are using the Kimball or Inmon methodology, the frontend database just before the Analysis Services cube should be a data mart. A data mart is a database that is modeled according to the rules of Kimball's dimensional modeling methodology, and is composed of fact tables and dimension tables.
As a result, we'll spend a lot of time discussing data mart structure in the rest of this chapter. However, you will not learn how to build and populate a data mart from reading this chapter; the books by Kimball and Inmon we've already cited do a much better job than we ever could.
Nevertheless, if you are reading this book, it means you are using Analysis Services and so you will need to design your data marts with specific features of Analysis Services in mind. This does not mean you should completely ignore the basic theory of data warehouse design and dimensional modeling, but instead, adapt the theory to the practical needs of the product you are going to use as the main interface for querying the data.
For this reason, we are going to present a summary of the theory and discuss how the theoretical design of the data warehouse is impacted by the adoption of Analysis Services.