Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Mastering Microsoft Power BI – Second Edition

You're reading from   Mastering Microsoft Power BI – Second Edition Expert techniques to create interactive insights for effective data analytics and business intelligence

Arrow left icon
Product type Paperback
Published in Jun 2022
Publisher Packt
ISBN-13 9781801811484
Length 712 pages
Edition 2nd Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Greg Deckler Greg Deckler
Author Profile Icon Greg Deckler
Greg Deckler
Brett Powell Brett Powell
Author Profile Icon Brett Powell
Brett Powell
Arrow right icon
View More author details
Toc

Table of Contents (18) Chapters Close

Preface 1. Planning Power BI Projects FREE CHAPTER 2. Preparing Data Sources 3. Connecting to Sources and Transforming Data with M 4. Designing Import, DirectQuery, and Composite Data Models 5. Developing DAX Measures and Security Roles 6. Planning Power BI Reports 7. Creating and Formatting Visualizations 8. Applying Advanced Analytics 9. Designing Dashboards 10. Managing Workspaces and Content 11. Managing the On-Premises Data Gateway 12. Deploying Paginated Reports 13. Creating Power BI Apps and Content Distribution 14. Administering Power BI for an Organization 15. Building Enterprise BI with Power BI Premium 16. Other Books You May Enjoy
17. Index

Dataset planning

After the source data is profiled and evaluated against the requirements identified in the four-step dataset design process, the BI team can further analyze the implementation options for the dataset. In almost all Power BI projects, even with significant investments in enterprise data warehouse architecture and ETL tools and processes, some level of additional logic, integration, or transformation is needed to enhance the quality and value of the source data or to effectively support a business requirement.

A priority of the dataset planning stage is to determine how the identified data transformation issues are addressed to support the dataset. Additionally, based on all available information and requirements, the project team must determine whether to develop an import mode dataset, DirectQuery dataset, or composite dataset. Import, DirectQuery, and composite datasets are explained in the section Import, DirectQuery, Live, and Composite Datasets.

The initial step in the dataset planning process is planning for data transformations, which we’ll review next.

Data transformations

To help clarify the dataset planning process, a diagram such as Figure 1.17 can be created that identifies the different layers of the data warehouse and Power BI dataset where transformation and business logic can be implemented:

Figure 1.17: Dataset planning architecture

In some projects, minimal transformation logic is needed and can be easily included in the Power BI dataset or the SQL views accessed by the dataset. For example, if only a few additional columns are needed for a dimension table and there’s straightforward guidance on how these columns should be computed, the IT organization may choose to implement these transformations within Power BI’s Power Query (M) queries rather than revise the data warehouse, at least in the short term.

If a substantial gap between BI needs and the corporate data warehouse is allowed to persist and grow due to various factors, such as cost, project expediency, and available data warehouse skills, then Power BI datasets become more complex to build and maintain. Dataset designers should regularly analyze and communicate the implications of datasets assuming greater levels of complexity.

However, if the required transformation logic is complex or extensive with multiple join operations, row filters, and data type changes, then the IT organization may choose to implement essential changes in the data warehouse to support the new dataset and future BI projects. For example, a staging table and a SQL stored procedure may be needed to support a revised nightly update process, or the creation of an index may be needed to deliver improved query performance for a DirectQuery dataset.

Ideally, all required data transformation and shaping logic could be implemented in the source data warehouse and its ETL processes so that Power BI is exclusively used for analytics and visualization. However, in the reality of scarce IT resources and project delivery timelines, typically at least a portion of these issues must be handled through other means, such as SQL view objects or Power BI’s M query functions.

A best practice is to implement data transformation operations within the data warehouse or source system. This minimizes the resources required to process an import mode dataset and, for DirectQuery datasets, can significantly improve query performance, as these operations would otherwise be executed during report queries.

For many common data sources, such as Oracle and Teradata, M query expressions are translated into equivalent SQL statements (if possible) and these statements are passed back to the source system via a process called query folding. See Chapter 2, Preparing Data Sources, for more details on query folding.

As per the dataset planning architecture diagram, a layer of SQL views should serve as the source objects to datasets created with Power BI Desktop. By creating a SQL view for each dimension and fact table of the dataset, the data source owner or administrator is able to identify the views as dependencies of the source tables and is therefore less likely to implement changes that would impact the dataset without first consulting the BI team.

Additionally, the SQL views improve the availability of the dataset, as modifications to the source tables are much less likely to cause the refresh process to fail.

As a general rule, the BI team and IT organization should avoid the use of DAX for data transformation and shaping logic, such as DAX calculated tables and calculated columns. The primary reason for this is that it weakens the link between the dataset and the data source, as these expressions are processed entirely by the Power BI dataset after source queries have been executed.

Additionally, the distribution of transformation logic across multiple layers of the solution (SQL, M, DAX) causes datasets to become less flexible and manageable. Moreover, tables and columns created via DAX do not benefit from the same compression algorithms applied to standard tables and columns and thus can represent both a waste of resources as well as a performance penalty for queries accessing these columns.

In the event that required data transformation logic cannot be implemented directly in the data warehouse or its ETL or Extract-Load-Transform (ELT) process, a second alternative is to build this logic into the layer of SQL views supporting the Power BI dataset. For example, a SQL view for the product dimension could be created that joins the Product, Product Subcategory, and Product Category dimension tables, and this view could be accessed by the Power BI dataset.

As a third option, M functions in the Power BI query expressions could be used to enhance or transform the data provided by the SQL views. See Chapter 2, Preparing Data Sources, for details on these functions and the Power BI data access layer generally.

Once data transformation planning is complete, the next step is to determine the mode of the dataset as explained in the next section.

Import, DirectQuery, Live, and Composite datasets

A subsequent but closely related step in dataset planning is choosing between the default import mode, DirectQuery mode, Live mode, or composite mode. In some projects, this is a simple decision as only one option is feasible or realistic given the known requirements while other projects entail significant analysis of the pros and cons of either design.

If a data source is considered slow or ill-equipped to handle a high volume of analytical queries, then an import mode dataset is very likely the preferred option. Likewise, if near real-time visibility of a data source is an essential business requirement, then DirectQuery or Live mode are the only options.

The DirectQuery and Live modes are very similar to one another. Both methods do not store data within the dataset itself but rather query source systems directly to retrieve data based upon user interaction with reports and dashboards. However, Live mode is only supported for Power BI datasets, Analysis Services (both multi-dimensional and tabular), and Dataverse.

When DirectQuery/Live is a feasible option or can be made a feasible option via minimal modifications, organizations may be attracted to the prospect of leveraging investments in high-performance database and data warehouse systems. However, the overhead costs and version control concerns of import mode can be reduced via Power BI features, such as the dataset refresh APIs or pipelines discussed in Chapter 10, Managing Application Workspaces and Content, and incremental data refresh.

The following list of questions can help guide an import versus DirectQuery/Live decision:

  1. Is there a single data source for our dataset that Power BI supports as a DirectQuery/Live source?

    For example, each fact and dimension table needed by the dataset is stored in a single data warehouse database, such as Oracle, Teradata, SQL Server, or Azure SQL Database.

    The following URL identifies the data sources supported for DirectQuery/Live with Power BI, including sources that are currently only in beta: http://bit.ly/2AcMp25.

  2. If DirectQuery/Live is an option per question 1, is this source capable of supporting the analytical query workload of Power BI?

    For example, although Azure Synapse (formerly Azure SQL Data Warehouse) technically supports DirectQuery, it’s not recommended to use Azure Synapse as a DirectQuery data source, given the limitations on the volume of concurrent queries supported and a lack of query plan caching.

    In many other scenarios, the data source may not be optimized for analytical queries, such as with star schema designs and indexes that target common BI/reporting queries. Additionally, if the database is utilized for Online Transaction Processing (OLTP) workloads and/or other BI/analytical tools, then it’s necessary to evaluate any potential impact on these applications and the availability of resources.

  3. Is an import mode dataset feasible, given the size of the dataset and any requirements for near real-time visibility of the data source?

    Currently, Power BI Premium supports import mode datasets up to 400 GB in size. However, the true limit for model sizes in Premium is limited to the total available amount of RAM within the capacity. In addition, PPU datasets are limited to 100 GB and Pro datasets are limited to 10 GB. Therefore, truly massive datasets must either use a DirectQuery data source or a Live connection to an Analysis Services model.

    Additionally, Power BI Premium currently supports a maximum of 48 refreshes per day for import mode datasets. Therefore, if there’s a need to view data source data for the last several minutes or seconds, an import mode dataset is not feasible.

  4. If the DirectQuery/Live source is capable of supporting a Power BI workload as per question 2, is the DirectQuery/Live connection more valuable than the additional performance and flexibility provided via the import mode?
  5. In other words, if an import mode dataset is feasible, as per question 3, then an organization should evaluate the trade-offs of the two modes. For example, since an import mode dataset is hosted in the Power BI service and in a compressed and columnar in-memory data store, it is likely to provide a performance advantage. This is particularly the case if the DirectQuery/Live source is hosted on-premises and thus queries from the Power BI cloud service must pass through the on-premises data gateway reviewed in Chapter 11, Managing the On-Premises Data Gateway.

    Additionally, any future data sources and most future data transformations need to be integrated into the DirectQuery/Live source. With an import mode dataset, the scheduled import process can include many data transformations and potentially other data sources without negatively impacting query performance.

For organizations that have invested in powerful data source systems for BI workloads, there’s a strong motivation to leverage this system via DirectQuery/Live. In general, business intelligence teams and architects are averse to copying data into another data store and thus creating both another data movement and a source of reporting that must be supported.

Let’s now take a more detailed look at each of the possible dataset modes.

Import mode

An import mode dataset can include multiple data sources, such as SQL Server, Oracle, and an Excel file. Since a snapshot of the source data is loaded into the Power BI cloud service, in addition to its in-memory columnar compressed structure, query performance is usually good for most scenarios.

Another important advantage of import mode datasets is the ability to implement data transformations without negatively impacting query performance. Unlike DirectQuery/Live datasets, the operations of data source SQL views and the M queries of import datasets are executed during the scheduled data refresh process. The Query design per dataset mode section of Chapter 2, Preparing Data Sources, discusses this issue in greater detail.

Given the performance advantage of the in-memory mode relative to DirectQuery/Live, the ability to integrate multiple data sources, and the relatively few use cases where real-time visibility is required, most Power BI datasets are designed using import mode.

Next, we provide more detail about DirectQuery/Live mode.

DirectQuery/Live mode

While DirectQuery and Live connections are different, as explained previously, they are similar to one another and share common traits, such as not storing data within the dataset itself but rather querying source systems directly to retrieve data based upon user interaction with reports and dashboards.

A DirectQuery/Live dataset is traditionally limited to a single data source and serves as merely a thin semantic layer or interface to simplify the report development and data exploration experience. DirectQuery/Live datasets translate report queries into compatible queries for the data source and leverage the data source for query processing, thus eliminating the need to store and refresh an additional copy of the source data.

A common use case of DirectQuery/Live is to provide near real-time visibility to changes in source data. For example, a manufacturer may want to monitor activities occurring on a manufacturing floor and potentially link this monitoring solution to notifications or alerts.

The performance of DirectQuery/Live datasets is strongly influenced by the design and resources available to the source system. Successful DirectQuery/Live datasets generally result from performance optimizations implemented in the source system such as via columnstore indexes, materialized views, and star schema designs that reduce the complexity of report queries.

With import and DirectQuery/Live modes understood, we next cover the relatively new composite mode.

Composite mode

Composite mode is perhaps the most significant enhancement to Power BI in recent years as it enables table-level control over a table’s storage mode (Import, DirectQuery, Dual).

Storage modes are covered in greater detail in Chapter 2, Preparing Data Sources. When designed effectively, a composite model can deliver the performance benefits of import (in-memory) models but also provide scalability for large DirectQuery source systems.

A common design pattern with composite models is to set the storage mode of a massive fact table to DirectQuery but configure a smaller, targeted aggregation table in import mode and related dimension tables in dual mode. Power BI automatically utilizes the in-memory aggregation table to resolve incoming report queries if the given aggregation table(s) and its related Dual mode dimension tables contain the necessary data.

Even more recently, Microsoft has unveiled DirectQuery for Power BI datasets and Azure Analysis Services. With DirectQuery for Power BI datasets and Azure Analysis Services, datasets developed and published to Power BI can be chained or extended to produce new datasets that incorporate additional import or DirectQuery sources.

With data transformation and data model storage mode decisions made, the dataset planning process is complete. Let’s now take a look at how these planning processes and decisions apply to our sample project introduced earlier in this chapter.

You have been reading a chapter from
Mastering Microsoft Power BI – Second Edition - Second Edition
Published in: Jun 2022
Publisher: Packt
ISBN-13: 9781801811484
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image