Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Mastering Microsoft Power BI – Second Edition

You're reading from   Mastering Microsoft Power BI – Second Edition Expert techniques to create interactive insights for effective data analytics and business intelligence

Arrow left icon
Product type Paperback
Published in Jun 2022
Publisher Packt
ISBN-13 9781801811484
Length 712 pages
Edition 2nd Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Greg Deckler Greg Deckler
Author Profile Icon Greg Deckler
Greg Deckler
Brett Powell Brett Powell
Author Profile Icon Brett Powell
Brett Powell
Arrow right icon
View More author details
Toc

Table of Contents (18) Chapters Close

Preface 1. Planning Power BI Projects 2. Preparing Data Sources FREE CHAPTER 3. Connecting to Sources and Transforming Data with M 4. Designing Import, DirectQuery, and Composite Data Models 5. Developing DAX Measures and Security Roles 6. Planning Power BI Reports 7. Creating and Formatting Visualizations 8. Applying Advanced Analytics 9. Designing Dashboards 10. Managing Workspaces and Content 11. Managing the On-Premises Data Gateway 12. Deploying Paginated Reports 13. Creating Power BI Apps and Content Distribution 14. Administering Power BI for an Organization 15. Building Enterprise BI with Power BI Premium 16. Other Books You May Enjoy
17. Index

Dataset design

Designing Power BI datasets is in many respects similar to designing data warehouses. Both datasets and data warehouses share concepts such as fact and dimension tables, star schemas, slowly changing dimensions, fact table granularity, and local and foreign keys for building relationships between tables.

This similarity allows us to use the same proven tools and processes for designing and building Power BI datasets as are used to design data warehouses. In this section, we cover the tools and processes used to design Power BI datasets, starting with the data warehouse bus matrix.

Data warehouse bus matrix

The data warehouse bus matrix is a staple of the Ralph Kimball data warehouse architecture, which provides an incremental and integrated approach to data warehouse design. This architecture, as per The Data Warehouse Toolkit (Third Edition) by Ralph Kimball, allows for scalable data models, as multiple business teams or functions often require access to the same business process data and dimensions.

To promote reusability and project communication, a data warehouse bus matrix of business processes and shared dimensions is recommended. An example data warehouse bus matrix is shown in Figure 1.11:

Figure 1.11: Data warehouse bus matrix

Each row in Figure 1.11 reflects an important and recurring business process, such as the monthly close of the general ledger, and each column represents a business entity, which may relate to one or several of the business processes. The shaded rows (Internet Sales, Reseller Sales, and Sales Plan) identify the business processes that will be implemented as their own star schemas for this project.

The bus matrix can be developed in collaboration with business stakeholders, such as the corporate finance manager, as well as source system and business intelligence or data warehouse SMEs.

The architecture of the dataset should support future BI and analytics projects of the organization involving the given business processes (fact tables) and business entities (dimension tables). For example, the same dataset containing Internet Sales data should support both an executive’s sales and revenue dashboard as well a business analyst’s ad hoc analysis via Excel PivotTables.

Additional business processes, such as maintaining product inventory levels, could potentially be added to the same Power BI dataset in a future project. Importantly, these future additions could leverage existing dimension tables, such as a product table, including its source query, column metadata, and any defined hierarchies.

Each Power BI report is usually tied to a single dataset. Given this 1:1 relationship and the analytical value of integrated reports across multiple business processes, such as Inventory and Internet Sales, it’s important to design datasets that can scale to support multiple star schemas. Consolidating business processes into one or a few datasets also makes solutions more manageable and is a better use of source system resources, as common tables (for example, Product, Customer) are only refreshed once.

The data warehouse bus matrix is a proven tool used during the design process of data warehouses and is just as effective for designing Power BI datasets. We cover this design process in the next section.

Dataset design process

With the data warehouse bus matrix as a guide, the business intelligence team can work with representatives from the relevant business teams and project sponsors to complete the following four-step dataset design process:

  1. Select the business process
  2. Declare the grain
  3. Identify the dimensions
  4. Define the facts

In the following sections, we cover each of these steps in detail, starting with selecting the business process.

Select the business process

Ultimately, each business process is represented by a fact table with a star schema of many-to-one relationships to dimensions. In a discovery or requirements gathering process, it can be difficult to focus on a single business process in isolation as users regularly analyze multiple business processes simultaneously or need to.

Nonetheless, it’s essential that the dataset being designed reflects low-level business activities (for example, receiving an online sales order) rather than consolidation or integration of distinct business processes such as a table with both online and reseller sales data:

  • Confirm that the answer provided to the first question of the project template from Table 1.2 regarding data sources is accurate.
  • In this project, the required business processes are Internet Sales, Reseller Sales, and Annual Sales and Margin Plan.
  • Each of the three business processes corresponds to a fact table to be included in the Power BI dataset.
  • Obtain a high-level understanding of the top business questions for each business process. For example, “What are total sales relative to the Annual Sales Plan and relative to last year?”.
  • In this project, Internet Sales and Reseller Sales are combined into overall corporate sales and margin KPIs.
  • Optionally, reference the data warehouse bus matrix of business processes and their related dimensions. For example, discuss the integration of inventory data and the insights this integration may provide.
  • In many projects, a choice or compromise has to be made given the limited availability of certain business processes and the costs or timelines associated with preparing this data for production use.
  • Additionally, business processes (fact tables) are the top drivers of the storage and processing costs of the dataset and thus should only be included if necessary.

A common anti-pattern (a response to a reoccurring problem that is generally ineffective and potentially counterproductive) to avoid in Power BI projects is the development of datasets for specific projects or teams rather than business processes. For example, developing a dataset exclusively for the marketing team and another dataset created for the sales organization. Assuming both teams require access to the same sales data, this approach naturally leads to a waste of resources, as the same sales data is queried and refreshed twice and both datasets consume storage resources in the Power BI service.

Additionally, this isolated approach leads to manageability and version control issues, as the datasets may contain variations in transformation or metric logic. Therefore, although the analytical needs of specific business users or teams are indeed the priority of BI projects, it’s important to plan for sustainable solutions that can ultimately be shared across teams.

Let’s now look at the next step in the process, declaring the grain.

Declare the grain

The grain of fact tables ultimately governs the level of detail available for analytical queries as well as the amount of data to be accessed. Higher grains mean more detail while lower grains mean less detail.

All rows of a fact table should represent the individual business process from step 1 at a certain level of detail or grain such as the header level or line level of a purchase order. Therefore, each row should have the same meaning and thus contain values for the same key columns to dimensions and the same numeric columns.

During this step, determine what each row of the different business processes represents. For example, each row of the Internet Sales fact table represents the line of a sales order from a customer. Conversely, the rows of the Sales and Margin Plan are aggregated to the level of a Calendar Month, Products Subcategory, and Sales Territory region.

If it’s necessary to apply filters or logic to treat certain rows of a fact table differently than others, the fact table likely contains multiple business processes (for example, shipments and orders). Although it’s technically possible to build this logic into DAX measure expressions, well-designed fact tables benefit Power BI and other data projects and tools over the long term. Thus, in such circumstances, it is advisable to split the table into two separate tables.

When analyzing the grain of fact tables, consider the following:

  • Review and discuss the implications of the chosen grain in terms of dimensionality and scale
  • Higher granularities provide greater levels of dimensionality and thus detail but result in much larger fact tables
  • If a high grain or the maximum grain is chosen, determine the row counts per year and the storage size of this table once loaded into Power BI datasets
  • If a lower grain is chosen, ensure that project stakeholders understand the loss of dimensionalities, such as the inability to filter for specific products or customers

In general, a higher granularity is recommended for analytical power and sustainability. If a less granular design is chosen, such as the header level of a sales order, and this grain later proves to be insufficient to answer new business questions, then either a new fact table would have to be added to the dataset or the existing fact table and all of its measures and dependent reports would have to be replaced.

Once the grains of all fact tables are determined, it is time to move on to the next step and identify the dimensions.

Identify the dimensions

Dimensions are a natural byproduct of the grain chosen in the previous design process step. A single sample row from the fact table should clearly indicate the business entities (dimensions) associated with the given process such as the customer who purchased an individual product on a certain date and at a certain time via a specific promotion.

Fact tables representing a lower grain have fewer dimensions. For example, a fact table representing the header level of a purchase order may identify the vendor but not the individual products purchased from the vendor.

When analyzing dimensions, consider the following:

  • Identify and communicate the dimensions that can be used to filter (aka slice and dice) each business process.
  • The foreign key columns based on the grain chosen in the previous step reference dimension tables.
  • Review a sample of all critical dimension tables, such as Product or Customer, and ensure these tables contain the columns and values necessary or expected.
  • Communicate which dimensions can be used to filter multiple business processes simultaneously. For example, in this project, the Product, Sales Territory, and Date dimensions can be used to filter all three fact tables.
  • The data warehouse bus matrix referenced earlier can be helpful for this step.
  • Look for any gap between the existing dimension tables and business questions or related reports.
  • For example, existing IT-supported reports may contain embedded logic that creates columns via Structured Query Language (SQL) that are not stored in the data warehouse.
  • Strive to maintain version control for dimension tables and the columns (attributes) within dimension tables.
  • It may be necessary for project stakeholders to adapt or migrate from legacy reports or an internally maintained source to the Corporate BI source.

A significant challenge to the identity of the dimensions step can be a lack of Master Data Management (MDM) and alternative versions. MDM is a discipline practiced by organizations in order to ensure the accuracy, uniformity, semantic consistency, and stewardship of the official data assets.

For example, the sales organization may maintain its own dimension tables in Excel or Microsoft Access and its naming conventions and hierarchy structures may represent a conflict or gap with the existing data warehouse. Additionally, many corporate applications may store their own versions of common dimensions, such as products and customers.

These issues should be understood and, despite pressure to deliver BI value quickly or according to a specific business team’s preferred version, the long-term value of a single definition for an entire organization as expressed via the bus matrix should not be sacrificed.

With dimensions identified, the final step is to define the fact tables.

Define the facts

The facts represent the numeric columns included in the fact tables. While the dimension columns from step 3 are used for relationships to dimension tables, the fact columns are used in measures containing aggregation logic such as the sum of a quantity column and the average of a price column.

When defining the facts, consider the following:

  • Define the business logic for each fact, represented by measures in the dataset. For example, gross sales is equal to the extended amount on a sales order, and net sales is equal to gross sales minus discounts.
  • Any existing documentation or relevant technical metadata should be reviewed and validated.
  • Similar to the dimensions, any conflicts between existing definitions should be addressed so that a single definition for a core set of metrics is understood and approved.
  • Additionally, a baseline or target source should be identified to validate the accuracy of the metrics to be created. For example, several months following the project, it should be possible to compare the results of DAX measures from the Power BI dataset to an SSRS report or a SQL query.
  • If no variance exists between the two sources, the DAX measures are valid and thus any doubt or reported discrepancy is due to some other factor

See Chapter 2, Preparing Data Sources, Chapter 3, Connecting Sources and Transforming Data with M, Chapter 4, Designing Import and DirectQuery Data Models, and Chapter 5, Developing DAX Measures and Security Roles, for details on the fact table columns to include in Power BI datasets (for import or DirectQuery) and the development of DAX metric expressions. The fact definitions from this step relate closely to the concept of base measures described in Chapter 5, Developing DAX Measures and Security Roles.

Ultimately, the DAX measures implemented have to tie to the approved definitions, but there are significant data processing, storage, and performance implications based on how this logic is computed. In many cases, the Power BI dataset can provide the same logic as an existing system but via an alternative methodology that better aligns with Power BI or the specific project need.

This concludes the dataset design process. Next, we cover another important topic related to datasets, data profiling.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime