You're reading from Mastering Microsoft Power BI Expert techniques for effective data analytics and business intelligence

Product type Paperback

Published in Mar 2018

Publisher Packt

ISBN-13 9781788297233

Length 638 pages

Edition 1st Edition

Languages

Tools

Power BI

Concepts

Business Intelligence

Author (1):

Brett Powell

View More author details

Dataset planning

After the source data has been profiled and evaluated against the requirements identified in the four-step dataset design process, the BI team can further analyze the implementation options for the dataset. In almost all Power BI projects, even with significant investments in enterprise data warehouse architecture and ETL tools and processes, some level of additional logic, integration, or transformation is needed to enhance the quality and value of the source data or to effectively support a business requirement. A priority of the dataset, planning stage is to determine how the identified data transformation issues will be addressed to support the dataset. Additionally, based on all available information and requirements, the project team must determine whether to develop an import mode dataset or a DirectQuery dataset.

Data transformations

To help clarify the dataset planning process, a diagram such as the following can be created that identifies the different layers of the data warehouse and Power BI dataset where transformation and business logic can be implemented:

Dataset planning architecture

In some projects, minimal transformation logic is needed and can be easily included in the Power BI dataset or the SQL views accessed by the dataset. For example, if only a few additional columns are needed for a dimension table and there's straightforward guidance on how these columns should be computed, the IT organization may choose to implement these transformations within Power BI's M queries rather than revise the data warehouse, at least in the short term.

If a substantial gap between BI needs and the corporate data warehouse is allowed to persist and grow due to various factors, such as cost, project expediency, and available data warehouse skills, then Power BI datasets will become more complex to build and maintain. Dataset designers should regularly analyze and communicate the implications of datasets assuming greater levels of complexity.

However, if the required transformation logic is complex or extensive with multiple join operations, row filters, and data type changes, then the IT organization may choose to implement essential changes in the data warehouse to support the new dataset and future BI projects. For example, a staging table and a SQL stored procedure may be needed to support a revised nightly update process or the creation of an index may be needed to deliver improved query performance for a DirectQuery dataset.

Ideally, all required data transformation and shaping logic could be implemented in the source data warehouse and its ETL processes so that Power BI is exclusively used for analytics and visualization. However, in the reality of scarce IT resources and project delivery timelines, typically at least a portion of these issues must be handled through other means, such as SQL view objects or Power BI's M query functions.

A best practice is to implement data transformation operations within the data warehouse or source system. This minimizes the resources required to process an import mode dataset and, for DirectQuery datasets, can significantly improve query performance, as these operations would otherwise be executed during report queries. For many common data sources, such as Oracle and Teradata, M query expressions are translated into equivalent SQL statements (if possible) and these statements are passed back to the source system via a process called Query Folding. See Chapter 2, Connecting to Sources and Transforming Data with M, for more details on query folding.

As per the dataset planning architecture diagram, a layer of SQL views should serve as the source objects to datasets created with Power BI Desktop. By creating a SQL view for each dimension and fact table of the dataset, the data source owner or administrator is able to identify the views as dependencies of the source tables and is therefore less likely to implement changes that would impact the dataset without first consulting the BI team. Additionally, the SQL views improve the availability of the dataset, as modifications to the source tables will be much less likely to cause the refresh process to fail.

As a general rule, the BI team and IT organization will want to avoid the use of DAX for data transformation and shaping logic, such as DAX calculated tables and calculated columns. The primary reason for this is that it weakens the link between the dataset and the data source, as these expressions are processed entirely by the Power BI dataset after source queries have been executed. Additionally, the distribution of transformation logic across multiple layers of the solution (SQL, M, DAX) causes datasets to become less flexible and manageable. Moreover, tables and columns created via DAX do not benefit from the same compression algorithms applied to standard tables and columns and thus can represent both a waste of resources as well as a performance penalty for queries accessing these columns.

In the event that required data transformation logic cannot be implemented directly in the data warehouse or its ETL or extract-load-transform (ELT) process, a secondary alternative is to build this logic into the layer of SQL views supporting the Power BI dataset. For example, a SQL view for the product dimension could be created that joins the Product, Product Subcategory, and Product Category dimension tables, and this view could be accessed by the Power BI dataset. As a third option, M functions in the Power BI query expressions could be used to enhance or transform the data provided by the SQL views. See Chapter 2, Connecting to Sources and Transforming Data with M, for details on these functions and the Power BI data access layer generally.

Import versus DirectQuery

A subsequent but closely related step to dataset planning is choosing between the default Import mode or DirectQuery mode. In some projects, this is a simple decision as only one option is feasible or realistic given the known requirements while other projects entail significant analysis of the pros and cons of either design. For example, if a data source is considered slow or ill-equipped to handle a high volume of analytical queries then an import mode dataset is very likely the preferred option. Additionally, if multiple data sources are required for a dataset and they cannot be consolidated into a single DirectQuery data source then an import mode dataset is the only option. Likewise, if near real-time visibility to a data source is an essential business requirement then DirectQuery is the only option.

When DirectQuery is a feasible option or can be made a feasible option via minimal modifications, organizations may be attracted to the prospect of leveraging investments in high-performance database and data warehouse systems. However, the overhead costs and version control concerns of import mode can be reduced via Power BI features, such as the dataset refresh APIs discussed in Chapter 8, Managing Application Workspaces and Content, and the expected incremental data refresh feature for Power BI Premium capacities.

The following list of questions can help guide an import versus DirectQuery decision:

Is there a single data source for our dataset which Power BI supports as a DirectQuery source?
- For example, each fact and dimension table needed by the dataset is stored in a single data warehouse database, such as Oracle, Teradata, SQL Server, or Azure SQL Database.
- The following URL identifies the data sources supported for DirectQuery with Power BI, including sources which are currently only in beta: http://bit.ly/2AcMp25.
If DirectQuery is an option per question 1, is this source capable of supporting the analytical query workload of Power BI?
- For example, although Azure SQL Data Warehouse technically supports DirectQuery, it's not recommended to use Azure SQL Data Warehouse as a DirectQuery data source, given the limitations on the volume of concurrent queries supported and a lack of query plan caching.
- In many other scenarios, the data source may not be optimized for analytical queries, such as with star schema designs and indexes which target common BI/reporting queries. Additionally, if the database is utilized for online transaction processing (OLTP) workloads and/or other BI/analytical tools, then it's necessary to evaluate any potential impact to these applications and the availability of resources.
Is an import mode dataset feasible, given the size of the dataset and any requirements for near real-time visibility to the data source?
- Currently Power BI Premium supports import mode datasets up to 10 GB in size and incremental data refresh is not available. Therefore, massive datasets must either use a DirectQuery data source or a Live connection to an Analysis Services model.
- Additionally, Power BI Premium currently supports a maximum of 48 refreshes per day for import mode datasets. Therefore, if there's a need to view data source data for the last several minutes or seconds, an import mode dataset is not feasible.
If the DirectQuery source is capable of supporting a Power BI workload as per question 2, is the DirectQuery connection more valuable than the additional performance and flexibility provided via the import mode?
- In other words, if an import mode dataset is feasible, as per question 3, then an organization should evaluate the trade-offs of the two modes. For example, since an import mode dataset will be hosted in the Power BI service and in a compressed and columnar in-memory data store, it will likely provide a performance advantage. This is particularly the case if the DirectQuery source is hosted on-premises and thus queries from the Power BI cloud service must pass through the On-premises data gateway reviewed in Chapter 9, Managing the On-Premises Data Gateway.
- Additionally, any future data sources and most future data transformations will need to be integrated into the DirectQuery source. With an import mode dataset, the scheduled import process can include many data transformations and potentially other data sources without negatively impacting query performance.

For organizations that have invested in powerful data source systems for BI workloads, there's a strong motivation to leverage this system via DirectQuery. A business intelligence team or architect will be adverse to copying data into another data store and thus creating both another data movement and a source of reporting that must be supported. As a result of this scenario, Microsoft is actively investing in improvements to DirectQuery datasets for Power BI and Analysis Services models in DirectQuery mode. These investments are expected to reduce the gap in query performance between DirectQuery and the import mode. Additionally, a hybrid dataset mode may be released that allows teams to isolate tables or even partitions of tables between DirectQuery and Import storage options.

However, Microsoft is also in the process of expanding support for large import mode Power BI datasets hosted in Power BI Premium capacity. For example, in the near future a dataset much larger than 10 GB could be incrementally refreshed to only update or load the most recent data. Additional details on the capabilities provided by Power BI Premium, potential future enhancements, and the implications for Power BI deployments are included in Chapter 13, Scaling with Premium and Analysis Services.

Import mode

An import mode dataset can include multiple data sources, such as SQL Server, Oracle, and an Excel file. Since a snapshot of the source data is loaded into the Power BI cloud service, in addition to its in-memory columnar compressed structure, query performance is usually good for most scenarios. Another important advantage of import mode datasets is the ability to implement data transformations without negatively impacting query performance. Unlike DirectQuery datasets, the operations of data source SQL views and the M queries of Import datasets are executed during the scheduled data refresh process. The Query Design per dataset mode section of Chapter 2, Connecting to Sources and Transforming Data with M, discusses this issue in greater detail.

Given the performance advantage of the in-memory mode relative to DirectQuery, the ability to integrate multiple data sources, and the relatively few use cases where real-time visibility is required, most Power BI datasets are designed using the import mode.

DirectQuery mode

A DirectQuery dataset is limited to a single data source and serves as merely a thin semantic layer or interface to simplify the report development and data exploration experience. DirectQuery datasets translate report queries into compatible queries for the data source and leverage the data source for query processing, thus eliminating the need to store and refresh an additional copy of the source data.

A common use case of Power BI and SSAS Tabular DirectQuery datasets is to provide reporting on top of relatively small databases associated with OLTP applications. For example, if SQL Server 2016 or later is used as the relational database for an OLTP application, nonclustered columnstore indexes can be applied to several tables needed for analytics. Since nonclustered indexes are updateable in SQL Server 2016, the database engine can continue to utilize existing indexes to process OLTP transactions, such as a clustered index on a primary key column while the nonclustered columnstore index will be used to deliver performance for the analytical queries from Power BI. The business value of near real-time access to the application can be further enhanced with Power BI features, such as data-driven alerts and notifications.