The source database
Analysis Services cubes are built on top of a database, but the real question is: what kind of database should this be?
We will try to answer this question by analyzing the different kinds of databases we will encounter in our search for the best source for our cube. In the process of doing so we are going to describe the basics of dimensional modeling, as well as some of the competing theories on how data warehouses should be designed.
The OLTP database
Typically, a BI solution is created when business users want to analyze, explore and report on their data in an easy and convenient way. The data itself may be composed of thousands, millions or even billions of rows, normally kept in a relational database built to perform a specific business purpose. We refer to this database as the On Line Transactional Processing (OLTP) database.
The OLTP database can be a legacy mainframe system, a CRM system, an ERP system, a general ledger system or any kind of database that a company has bought or built in order to manage their business.
Sometimes the OLTP may consist of simple flat files generated by processes running on a host. In such a case, the OLTP is not a real database but we can still turn it into one by importing the flat files into a SQL Server database for example. Therefore, regardless of the specific media used to store the OLTP, we will refer to it as a database.
Some of the most important and common characteristics of an OLTP system are:
The OLTP system is normally a complex piece of software that handles information and transactions; from our point of view, though, we can think of it simply as a database.
We do not normally communicate in any way with the application that manages and populates the data in the OLTP. Our job is that of exporting data from the OLTP, cleaning it, integrating it with data from other sources, and loading it into the data warehouse.
We cannot make any assumptions about the OLTP database's structure.
Somebody else has built the OLTP system and is probably currently maintaining it, so its structure may change over time. We do not usually have the option of changing anything in its structure anyway, so we have to take the OLTP system "as is" even if we believe that it could be made better.
The OLTP may well contain data that does not conform to the general rules of relational data modeling, like foreign keys and constraints.
Normally in the OLTP system, we will find historical data that is not correct. This is almost always the case. A system that runs for years very often has data that is incorrect and never will be correct.
When building our BI solution we'll have to clean and fix this data, but normally it would be too expensive and disruptive to do this for old data in the OLTP system itself.
In our experience, the OLTP system is very often poorly documented. Our first task is, therefore, that of creating good documentation for the system, validating data and checking it for any inconsistencies.
The OLTP database is not built to be easily queried, and is certainly not going to be designed with Analysis Services cubes in mind. Nevertheless, a very common question is: "do we really need to build a dimensionally modeled data mart as the source for an Analysis Services cube?" The answer is a definite "yes"!
As we'll see, the structure of a data mart is very different from the structure of an OLTP database and Analysis Services is built to work on data marts, not on generic OLTP databases. The changes that need to be made when moving data from the OLTP database to the final data mart structure should be carried out by specialized ETL software, like SQL Server Integration Services, and cannot simply be handled by Analysis Services in the Data Source View.
Moreover, the OLTP database needs to be efficient for OLTP queries. OLTP queries tend to be very fast on small chunks of data, in order to manage everyday work. If we run complex queries ranging over the whole OLTP database, as BI-style queries often do, we will create severe performance problems for the OLTP database. There are very rare situations in which data can flow directly from the OLTP through to Analysis Services but these are so specific that their description is outside the scope of this book.
Beware of the temptation to avoid building a data warehouse and data marts. Building an Analysis Services cube is a complex job that starts with getting the design of your data mart right. If we have a dimensional data mart, we have a database that holds dimension and fact tables where we can perform any kind of cleansing or calculation. If, on the other hand, we rely on the OLTP database, we might finish our first cube in less time but our data will be dirty, inconsistent and unreliable, and cube processing will be slow. In addition, we will not be able to create complex relational models to accommodate our users' analytical needs.
The data warehouse
We always have an OLTP system as the original source of our data but, when it comes to a data warehouse, it can be difficult to answer this apparently simple question: "Do we have a data warehouse?" The problem is not the answer, as every analyst will happily reply, "Yes, we do have a data warehouse"; the problem is in the meaning of the words "data warehouse".
There are at least two major approaches to data warehouse design and development and, consequently, to the definition of what a data warehouse is. They are described in the books of two leading authors:
Ralph Kimball: if we are building a Kimball data warehouse, we build fact tables and dimension tables structured as data marts. We will end up with a data warehouse composed of the sum of all the data marts.
Bill Inmon: if our choice is that of an Inmon data warehouse, then we design a (somewhat normalized), physical relational database that will hold the data warehouse. Afterwards, we produce departmental data marts with their star schemas populated from that relational database.
If this were a book about data warehouse methodology then we could write hundreds of pages about this topic but, luckily for the reader, the detailed differences between the Inmon and Kimball methodologies are out of the scope of this book. Readers can find out more about these methodologies in Building the Data Warehouse by Bill Inmon and The Data Warehouse Toolkit by Ralph Kimball. Both books should be present on any BI developer's bookshelf.
A picture is worth a thousand words when trying to describe the differences between the two approaches. In Kimball's bus architecture, data flows from the OLTP through to the data marts as follows:
In contrast, in Inmon's view, data coming from the OLTP systems needs to be stored in the enterprise data warehouse and, from there, goes to the data marts:
What is important is to understand is that the simple phrase "data warehouse" has different meanings in each of these methodologies.
We will adopt Inmon's meaning for the term "data warehouse". This is because in Inmon's methodology the data warehouse is a real database, while in Kimball's view the data warehouse is composed of integrated data marts. For the purposes of this chapter, though, what is really important is the difference between the data warehouse and the data mart, which should be the source for our cube.
The data mart
Whether you are using the Kimball or Inmon methodology, the front-end database just before the Analysis Services cube should be a data mart. A data mart is a database that is modeled according to the rules of Kimball's dimensional modeling methodology, and is composed of fact tables and dimension tables.
As a result we'll spend a lot of time discussing data mart structure in the rest of this chapter. However, you will not learn how to build and populate a data mart from reading this chapter; the books by Kimball and Inmon we've already cited do a much better job than we ever could.