Combining and Merging Queries
The full power of Power BI's querying capabilities is in the integration of distinct queries representing different data sources via its merge and append transformations. Retrieval processes that consolidate files from multiple network locations or integrate data from multiple data sources can be developed efficiently and securely. Additionally, the same join types and data transformation patterns SQL and ETL developers are familiar with can be achieved with the M language. This recipe provides examples of combining sources into a single query and leveraging the table join functions of M to support common transformation scenarios.
Getting ready
To follow along with this recipe, you can use the Merge Queries and Append Queries icons on the Home tab of the Power Query Editor to generate the join expressions used in this recipe. However, as joining queries is fundamental to the retrieval process, it is recommended to learn how to use the Table.Join
, Table.NestedJoin
, and Table.Combine
functions.
To get ready for this recipe, import the DimCustomer
and FactCallCenter
tables from the AdventureWorksDW2019 database by doing the following:
- Open Power BI Desktop and choose Transform data from the ribbon of the Home tab to open the Power Query Editor.
- Create an Import mode data source query called
AdWorksDW
. This query should be similar to the following:let Source = Sql.Database("localhost\MSSQLSERVERDEV", "AdventureWorksDW2019") in Source
- Isolate this query in a query group called Data Sources.
- Right-click
AdWorksDW
and choose Reference, select theDimEmployee
table in the data preview area, and rename this queryDimEmployee
. - Repeat step 4 but choose the FactInternetSales table and name this query Sales2011. Filter the OrderDate column to be between January 1, 2011 and December 31, 2011.
Let Source = AdWorksDW, dbo_FactInternetSales = Source{[Schema="dbo",Item="FactInternetSales"]}[Data], #"Filtered Rows" = Table.SelectRows( dbo_FactInternetSales, each [OrderDate] >= #datetime(2011, 1, 1, 0, 0, 0) and [OrderDate] <= #datetime(2011, 12, 31, 0, 0, 0)) in #"Filtered Rows"
- Right-click the Sales2011 query and choose Duplicate. Name this query Sales2012. Edit the
Table.SelectRows
expression to filter dates between January 1, 2012 and December 31, 2012. - Repeat step 6, naming this new query Sales2013 and filtering for dates between January 1, 2013 and December 31, 2013.
For additional details on performing these steps, see the Managing Queries and Data Sources recipe in this chapter.
How to Combine Queries
The goal of this example is to produce an integrated table based on three "Sales" queries. While in this example the three queries come from the same data source, it is important to realize that the three queries could point to completely different data sources, such as three text files or even a text file, an Excel spreadsheet, and a SQL database. What is important is that the three queries have the same column names and number of columns. To implement this recipe, perform the following steps:
- Move the Sales2011, Sales2012, and Sales2013 queries to a new query group called Staging Queries.
- Disable the load for the Sales2011, Sales2012, and Sales2013 queries.
- Create a new Blank query in the Other Queries group and call this query FactInternetSales2011to2012.
- Open the FactInternetSales2011to2012 query in Advanced Editor.
- Use the
Table.Combine
(or Append Queries feature in the ribbon) function to return a single table based on the rows of the Sales2011, Sales2012, and Sales2013 queries.Let Source = Table.Combine( {Sales2011, Sales2012, Sales2013} ) in Source
- Move the DimEmployees query to the Staging Queries group and disable loading.
- Duplicate the DimEmployees query and call this new query Managers.
- Disable loading for the Managers query.
- Open the Managers query in the Advanced Editor.
- Add a Manager Name column using the
Table.AddColumn
function.ManagerName = Table.AddColumn( dbo_DimEmployee, "Manager Name", each [FirstName] & " " & [LastName] )
- Select only the EmployeeKey, Manager Name, and Title columns using the
Table
.SelectColumns
function.SelectCols = Table.SelectColumns( ManagerName, { "EmployeeKey", "Manager Name", "Title" } )
- Create a new Blank query in the Other Queries group and call this query Employees.
- Open the Employees query in Advanced Editor.
- Join the
Managers
query andDimEmployee
query using theTable.NestedJoin
function or the Merge Queries feature in the ribbon.Let Source = Table.NestedJoin( DimEmployee, "ParentEmployeeKey", Managers, "EmployeeKey", "ManagerColumn", JoinKind.LeftOuter ) in Source
- Use the
Table
.ExpandTableColumn
function to add theManager Name
andManager Title
columns.ManagerColumns = Table.ExpandTableColumn( Source, "ManagerColumn", { "Manager Name", "Title" }, { "Manager Name", "Manager Title" } )
- Add an Employee Name column.
EmployeeName = Table.AddColumn( ManagerColumns, "Employee Name", each [FirstName] & " " & [LastName] )
- Rename the Title column to Employee Title.
RenameTitle = Table.RenameColumns( EmployeeName, { "Title", "Employee Title" } )
- Select the
EmployeeKey
,Employee Name
,Employee Title
,Manager Name
, andManager Title
columns.SelectCols = Table.SelectColumns( RenameTitle, { "EmployeeKey", "Employee Name", "Employee Title", "Manager Name", "Manager Title" } )
How it works
For the FactInternetSales2011to2013 query, only a single expression is required using the Table.Combine
function. No other expressions are necessary in this example given that the staging queries have identical column names and the same number of columns. The Table.Combine
function performs an append operation and does not remove duplicate rows similar to a SQL UNION statement.
Any columns which are unique to one of the input tables in a Table.Combine
function will be added to the result set with null values for the rows from the other tables. Depending on the scenario, the developer could apply the Table.Distinct
function to avoid any duplicate rows from reaching the data model.
The Employees query references the DimEmployees query as the left table in a Table.NestedJoin
function, and is joined to the Managers
query via a left outer join. The left join is required to retain all employee rows in this scenario, as the DimEmployees
table includes one employee that does not have a parent employee key, the Chief Executive Officer
.
The join is performed on the ParentEmployeeKey
column in the DimEmployees query to the EmployeeKey
column in the Managers query. After this step, all of the rows from the matching Managers table are stored within the row as a Table
object in the column ManagerColumn. When expanding the ManagerColumn column using the Table.ExpandTableColumn
function, the Manager Name column can retain the same name, but the Title column is renamed to Manager Title in order to avoid conflicting with the Title column in the DimEmployees query.
In implementing the table joins, you can choose to use the Table.Join
and Table.NestedJoin
functions. All six join types—inner, left outer, right outer, full outer, left anti, and right anti—are supported by both functions. The Table.NestedJoin
function enters the results of the join (the right or second table) into a new column of table values and will use local resources to execute the join operation, unless the Table.ExpandTableColumn
function is used to replace this new column with columns from the right table. A left outer join type is the default if the JoinKind
parameter is not specified. For performance reasons, Table.NestedJoin
should not be used without a Table.ExpandTableColumn
function removing the column of tables.
Conversely, the Table.Join
function automatically expands the left table with the columns from the right table input (a flat join) and defaults to an inner join if the JoinKind
parameter is not specified. The Table.Join
function gets folded to the source without any additional functions but requires that there are no matching column names between the joined tables for a JoinKind
other than inner join. For inner joins, the matching column names from both tables must be specified in the join key parameters. A Table.SelectColumns
function is required to exclude any columns from the right table added with the join.
Whether implemented via Table.NestedJoin
or Table.Join
, developers should look to use inner joins if the source tables have referential integrity, such as with foreign key constraints and whether this meets requirements. For joins against larger tables, developers should confirm that query folding is occurring and can evaluate the different query plans generated by alternative retrieval designs in terms of performance.
Note that the two rows for Rob Walters are due to a Slowly Changing Dimension (SCD) Type 2 process applied in the source database. For more information on SCDs, refer to this Wikipedia article: https://bit.ly/3yIQeI5.
There's more...
Rather than creating separate lookup/join staging queries, it is possible to consolidate these expressions into a single let…in
M expression. For example, the following single query returns the exact same results as the Sales2011, Sales2012, Sales2013, and FactInternetSales2011to2013 queries:
let
Source = AdWorksDW,
Sales = Source{[Sche"a=""bo",It"m="FactInternetSa"es"]}[Data],
Sales2011Rows =
Table.SelectRows(
Sales, each
[OrderDate] >= #datetime(2011, 1, 1, 0, 0, 0) and
[OrderDate] <= #datetime(2011, 12, 31, 0, 0, 0)
),
Sales2012Rows =
Table.SelectRows(
Sales, each
[OrderDate] >= #datetime(2012, 1, 1, 0, 0, 0) and
[OrderDate] <= #datetime(2012, 12, 31, 0, 0, 0)
),
Sales2013Rows =
Table.SelectRows(
Sales, each
[OrderDate] >= #datetime(2013, 1, 1, 0, 0, 0) and
[OrderDate] <= #datetime(2013, 12, 31, 0, 0, 0)
),
Append = Table.Combine( {Sales2011Rows, Sales2012Rows, Sales2013Rows} )
in
Append
Inline query approaches are helpful in limiting the volume of queries, but you lose the management benefits provided by group folders and the Query Dependencies view. The graphical support makes it easier to explain and quickly troubleshoot a data retrieval process over a single but complex M expression. Staging queries are recommended for all but the most trivial projects and retrieval processes. Staging queries should generally never be loaded to the data model, as staging tables could both confuse the user and would require the data model to use additional resources to process and store the additional data.
Similarly, merge queries can also be combined into a single query. The following table breaks down the six different join types that can be specified in both the Table.NestedJoin
and Table.Join
functions. Both the Parameter and Parameter Value can be used, though the recipes in this book use Parameter as this makes the expressions easier to follow.
Join type |
Parameter |
Parameter value |
Inner |
|
0 |
Left Outer |
|
1 |
Right Outer |
|
2 |
Full Outer |
|
3 |
Left Anti |
|
4 |
Right Anti |
|
5 |
Table 2.3: Power Query (M) join types, parameters, and parameter values
One final note is that, for data source files with the same structure stored in the same network directory folder, Power BI offers the Combine Binaries transformation, which can be used with text, CSV, Excel, JSON, and other file formats. This feature can be used when creating a Folder query. The Combine Binaries feature automatically creates an example query and a function linked to this query, such that any required modification to the source files can be applied to all files, and the source location of the files can be easily revised.
See also
Table.Combine
: http://bit.ly/3c6L2o0Table.NestedJoin
: http://bit.ly/30ZO3jZTable.Join
: http://bit.ly/3lxMRh7Table.ExpandTableColumn
: http://bit.ly/3eY6u0zTable.AddColumn
: http://bit.ly/3vGJZ6bTable.SelectColumns
: http://bit.ly/38Qk7LtTable.RenameColumns
: http://bit.ly/3rTVfd4- M table function reference: http://bit.ly/2oj0k0I
- Combining binaries in Power BI Desktop: http://bit.ly/2oL2nM4