What data modeling means in Power BI
Data modeling is undoubtedly one of the most important parts of Power BI development. It is crucial to understand the purpose of data modeling in Power BI from data models in transactional systems. In a transactional system, the goal is to have a model optimized for recording transactional data. Nevertheless, a well-designed data model in Power BI must be optimized for querying the data and reducing the dataset size by aggregating the data.
In reality, not everyone has the luxury of having a data warehouse, so it is a vital skill to know how to create a data model in Power BI. While it is very tempting to get all the data from various data sources and import it to Power BI, answering business questions can quickly translate to complex queries that take a long time to process, which is not ideal. The best practice is to avoid importing everything from the data sources into Power BI and solving the related problems later such as performance issues, data model complexities, and having unnecessarily large data models. Instead, it is wise to get the data model right to precisely answer business-driven questions in the most performant way. When modeling data in Power BI, we must build a data model based on the business logic. So, we may need to join different tables and aggregate the data to a level that answers all business-driven questions, which can be tricky when we have data from various data sources of different grains.
Therefore, we need to transform and reshape the data in Power Query before loading it into the data model. After cutting all the noise from the data, we have a clean, easy-to-understand, and easy-to-work-with data model.
Semantic model
Power BI inherits its characteristics from Power Pivot and SSAS Tabular. Both of them use the xVelocity engine, an updated version of the VertiPaq engine designed for in-memory data analysis. The xVelocity engine leverages column store indexing and consists of semantic model objects such as tables, relationships, hierarchies, and measures stored in memory. All of this means that we would expect tremendous performance gains over highly compressed data, right? Well, it depends. We can expect fast and responsive reports if we efficiently transform and model the data supporting the business logic. Conceptually, the data model in Power BI is a semantic model. Let us untangle this a bit.
A semantic model is a unified data model that provides business contexts to data. The semantic model can be accessed from various data visualization tools, such as Excel, without further transformation. When we publish a Power BI report file (PBIX) to the Power BI Service, the service stores the report in two separate objects. The transformation steps and the data model are stored as a Dataset
object and the report as a Report
object. A dataset in the Power BI Service is indeed our semantic model. We can connect to the datasets from Power BI Desktop, analyze the datasets in Excel, or use third-party tools such as Tableau. The latter requires an XMLA endpoint connection to a Power BI Premium dataset.
We will not cover the details of XMLA endpoints in this book. You can read more about XMLA endpoints here: https://docs.microsoft.com/en-us/power-bi/enterprise/service-premium-connect-tools?WT.mc_id=DP-MVP-5003466.
Building an efficient data model in Power BI
An efficient data model is easy to understand and easy to maintain. At the same time, it must answer all data-driven questions the business may ask. Let us analyze the preceding sentence. An efficient model must do the following:
- Perform well (be fast)
- Be business-driven
- Decrease the complexity of the required DAX expressions (easy to understand)
- Low maintenance (low cost)
Let us look at the preceding points with a scenario.
We are tasked to create a report on top of the following three different data sources:
- An OData data source with 15 tables. The tables have between 50 and 250 columns.
- An Excel file with 20 sheets with interdependencies and many formulas.
- A data warehouse hosted in SQL Server. The data comes from five dimensions and two fact tables:
- Within those five dimensions is a Date and a Time dimension. The grain of the Time dimension is the hour and minute.
- Each of the fact tables has between 50 and 200 million rows. From a Date and Time perspective, the grain of both fact tables is the day, hour, and minute.
- The organization owns Power BI Pro licenses.
Before getting the data from the source systems, there are essential points in the preceding scenario, and many points are unclear at this stage.
Let us analyze the scenario, along with some related questions we might ask the business to optimize the performance of the report and avoid customer dissatisfaction:
- OData: OData is an online data source, so it could be slow to load the data.
- The tables are wide, which can potentially impact the performance.
- Do we need to import all the columns from those 15 tables?
- Do we also need to import all data, or is just a portion of the data enough? For example, if the data source contains 10 years of data, does the business need to analyze all the historical data, or does bringing 1 or 2 years of data fit the purpose?
- The organization owns Power BI Pro licenses, so a 1 GB file size limit applies.
- Excel: Excel files with many formulas can be tricky data sources.
- Does the business require you to analyze all the data contained in the 20 sheets? We may be able to exclude some of those sheets.
- How often are the formulas edited? This is critical as modifying the formulas can easily break the data processing in Power Query and generate errors. It is best to replicate the Excel formulas in Power BI and load the raw data from Excel before the formulas are applied.
- Data warehouse in SQL Server: It is beneficial to have a data warehouse as a source system, as data warehouses typically have a much better structure from an analytical viewpoint. In our scenario, the finest grain of both fact tables is down to a minute, which can quickly become an issue. Remember, we have Power BI Pro licenses, so we are limited to a 1 GB file size only.
- Does the business need to analyze all the metrics down to the minute, or is day-level enough?
- Is the business required to analyze the whole history, or is bringing a portion of the data enough?
We now have a handful of questions to ask. One common question on the list is about the necessity of analyzing all the historical data. What if the business needs to analyze the whole history? In that case, we should consider using some advanced modeling techniques such as composite models and aggregations.
On the bright side, the fact that we already have five dimensions in the data warehouse might become handy. We might reuse those dimensions with minimal changes in our data model. So, it is wise to look at the other data sources and find commonalities in the data patterns.
We may come up with some more legitimate points and questions later. The takeaway is that we have to talk to the business and ask questions before starting the job. It is a big mistake to start getting data from the source systems before framing the questions around the business processes, requirements, and technology limitations. There are also other points we need to think about from a project management perspective, which are beyond the scope of this book.
The following are initial points to take into account for building an efficient data model:
- We must ask questions to avoid confusion and potential future reworks.
- We need to understand the technical limitations and come up with solutions.
- We have to have a good understanding of data modeling to look for common data patterns to prevent overlaps.
At this point, you may think, “OK, but how can we get there?” This book aims to cover all the preceding points and more. The rest is about you and how you apply your learning to your daily Power BI challenges. The section explains the star schema and snowflaking.
Star schema (dimensional modeling) and snowflaking
First things first, the star schema and dimensional modeling are the same things. In Power BI data modeling, the term star schema is more commonly used. The following sections are generic reminders about some star schema concepts.
Transactional modeling versus star schema modeling
In transactional systems, the main goal is improving the solution’s performance in creating new records and updating/deleting existing ones. So, when designing transactional systems, it is essential to go through the normalization process to decrease data redundancy and increase data entry performance by breaking the tables down into master-detail tables.
But the goal of a business analysis system is very different. In a business analysis solution, we need a data model optimized for querying in the most performant way.
Let us continue with a scenario. Suppose we have a transactional retail system for an international retail shop. We have hundreds of transactions every second from different parts of the world. The company owners want to see the total sales in the past 6 months.
This calculation sounds easy. It is just a simple SUM
of sales. But wait, we have hundreds of transactions every second, right? If we have 100 transactions per second, then we have 8,640,000 transactions a day. So, for 6 months of data, we have more than 1.5 billion rows. Therefore, a simple SUM
of sales takes a reasonable amount of time to process.
Now, the business raises a new request. The company owners now want to see the total sales in the past 6 months by country and city. They simply want to know what the best-selling cities are.
We need to add another condition to our simple SUM
calculation, which translates into a join
to the geography table. For those coming from a relational database design background, it is trivial that joins are relatively expensive operations. This scenario can go on and on. So, you can imagine how quickly a simple scenario turns into a rather tricky situation.
In the star schema, however, we already joined all those tables based on business entities. We aggregated and loaded the data into denormalized tables. In the preceding scenario, the business is not interested in seeing every transaction at the second level. So, summarizing the data at the day level decreases the row count from 1.5 billion to approximately 18,000 rows for the 6 months. Now you can imagine how fast the summation would run over 18,000 instead of 1.5 billion rows.
The idea of the star schema is to keep all numeric values in separate tables called fact tables and put all descriptive information into other tables called dimension tables. Usually, the fact tables are surrounded by dimensions explaining the facts. A data model with a fact table in the middle surrounded by dimensions looks like a star, which is why this modeling approach is called a star schema.
This book generally uses Adventure Works DW data, a renowned Microsoft sample dataset, unless stated otherwise. Adventure Works is an imaginary international bike shop selling products online and in their retail shops.
The following figure shows Internet Sales in a star schema shape:
Figure 1.9: Internet Sales star schema
Snowflaking
Snowflaking is when we do not have a perfect star schema when dimension tables surround the fact tables. In some cases, we have some levels of descriptions stored in different tables. Therefore, some dimensions in the model are linked to other tables describing the dimensions in more detail. Snowflaking is normalizing the dimension tables. In some cases, snowflaking is inevitable; nevertheless, the general rule of thumb in data modeling in Power BI (when following a star schema) is to avoid snowflaking as much as possible to have a simpler and more performant model. The following figure shows snowflaking in Adventure Works Internet Sales:
Figure 1.10: Adventure Works, Internet Sales snowflakes
We could avoid the Product Category and Product Subcategory snowflakes in the preceding model by denormalizing them into the Product table. The following section explains denormalization.
Understanding denormalization
In real-world scenarios, not everyone has the luxury of having a pre-built data warehouse designed in a star schema. In reality, snowflakes in data warehouse designs are inevitable. The data models are usually connected to various data sources, including transactional database systems and non-transactional data sources such as Excel files and CSV files. So, we almost always need to denormalize the model to a certain degree. Depending on the business requirements, we may have some normalization along with some denormalization. The reality is that there is no specific rule for the level of normalization and denormalization. The general rule of thumb is denormalizing the model, so each dimension describes all the related details as much as possible.
In the preceding example from Adventure Works DW, we have snowflakes of Product Category and Product Subcategory that can be denormalized into the Product dimension.
Let us go through a hands-on exercise.
Go through the following steps to denormalize the Product Category and Product Subcategory into the Product dimension.
You need to download the Adventure Works, Internet Sales.pbix
file from here:
You also need to download the Excel file used as the data source in the PBIX file from here:
Before we start, let us set up the sample file first. Open the Adventure Works, Internet Sales.pbix
file, then change the Adventure Works DW Excel Path parameter as follows:
- Click the Transform data drop-down button.
- Click Edit parameters.
- Enter the path for the .save location of the
AdventureWorksDW2017.xlsx
file. - Click OK.
The following image illustrates the preceding steps:
Figure 1.11: Setting up the Adventure Works, Internet Sales.pbix sample file
After changing the values of the parameter, we have to Apply changes.
Figure 1.12: Applying changes after changing the parameter value
The preceding process reloads the data from the Excel file.
Now that we have correctly set up the sample file, follow these steps:
- Click Transform data in the Home tab of the Queries section.
- Click the Product query.
- Click Merge Queries in the Home tab of the Combine section.
- Select Product Subcategory from the drop-down list.
- Click ProductSubcategoryKey in the Product table.
- Click ProductSubcategoryKey in the Product Subcategory table.
- Select Left Outer (all from first matching from the second) from the Join Kind drop-down.
- Click OK.
Figure 1.13: Merging Product and Product Subcategory
This adds a new step named Merged Queries. As you can see, the values of this column are all Table. This type of column is called a Structured Column. The merging step creates a new structured column named Product Subcategory:
You will learn more about structured columns in Chapter 3, Data Preparation in Power Query Editor.
Figure 1.14: Merging the Product and Product Subcategory tables
Now let us look at how to expand a structured column in the Query Editor:
- Click the Expand button to expand the Product Subcategory column.
- Select ProductCategoryKey.
- Select the EnglishProductSubcategoryName columns and unselect the rest.
- Unselect Use original column name as prefix.
- Click OK.
Figure 1.15: Expanding Structured Column in the Query Editor
So far, we have added the EnglishProductSubcategoryName and ProductCategoryKey columns from the Product Subcategory query to the Product query. The next step is to add EnglishProductCategoryName from the Product Category query. To do so, we need to merge the Product query with Product Category:
- Click Merge Queries again.
- Select Product Category from the drop-down list.
- Select ProductCategoryKey from the Product table.
- Select ProductCategoryKey from the Product Category table.
- Select Left Outer (all from first matching from second).
- Click OK:
Figure 1.16: Merging Product and Product Category
This adds a new structured column named Product Category. We now need to do the following:
- Expand the new column.
- Pick EnglishProductCategoryName from the list.
- Unselect Use original column name as prefix.
- Click OK:
Figure 1.17: Merging Product and Product Category
So far, we moved the EnglishProductCategoryName into the Product table. Therefore, the ProductCategoryKey column is no longer needed. So, the next step is removing the ProductCategoryKey column as we no longer need it. To do so, follow these steps:
- Click on the ProductCategoryKey column.
- Click the Remove Columns button in the Managed Column section of the Home tab:
Figure 1.18: Removing a column in the Query Editor
We have merged the Product Category and Product Subcategory snowflakes with the Product query. So, we successfully denormalized the snowflakes.
The very last step is to unload both the Product Category and Product Subcategory queries:
- Right-click on each query.
- Untick Enable load from the menu.
- Click Continue on the Possible Data Loss Warning pop-up message:
Figure 1.19: Unloading queries in the Query Editor
Now we need to import the data into the data model by clicking Close & Apply:
Figure 1.20: Importing data into the data model
We have now achieved what we were after: we denormalized the Product Category and Product Subcategory tables, so instead of loading those two tables, we now have EnglishProductCategoryName and EnglishProductSubcategoryName as new columns in the Product table. So the data model is simpler now, and we load less data, leading to better performance.
Job done!