Optimizing MDX queries using the NonEmpty() function
The NonEmpty()
function is a very powerful MDX function. It is primarily used to improve query performance by reducing sets before the result is returned.
Both Customer
and Date
dimensions are relatively large in the Adventure Works DW 2016 database. Putting the cross product of these two dimensions on the query axis can take a long time. In this recipe, we will show how the NonEmpty()
function can be used on the Customer
and Date
dimensions to improve the query performance.
Getting ready
Start a new query in SSMS and make sure that you are working on the Adventure Works DW 2016 database. Then write the following query and execute it:
SELECT { [Measures].[Internet Sales Amount] } ON 0, NON EMPTY Filter( { [Customer].[Customer].[Customer].MEMBERS } * { [Date].[Date].[Date].MEMBERS }, [Measures].[Internet Sales Amount] > 1000 ) ON 1 FROM [Adventure Works]
The query shows the sales per customer and dates of their purchases, and isolates only those combinations where the purchase was over 1,000 USD.
On a typical server, it will take more than a minute before the query will return the results.
Now let us see how to improve the execution time by using the NonEmpty()
function.
How to do it...
Follow these steps to improve the query performance by adding the NonEmpty()
function:
- Wrap
NonEmpty()
function around the cross join of customers and dates so that it becomes the first argument of that function. - Use the measure on columns as the second argument of that function.
- This is what the MDX query should look like:
SELECT { [Measures].[Internet Sales Amount] } ON 0, NON EMPTY Filter( NonEmpty( { [Customer].[Customer].[Customer].MEMBERS } * { [Date].[Date].[Date].MEMBERS }, { [Measures].[Internet Sales Amount] } ), [Measures].[Internet Sales Amount] > 1000 ) ON 1 FROM [Adventure Works]
- Execute that query and observe the results as well as the time required for execution. The query returned the same results, only much faster, right?
How it works...
Both the Customer
and Date
dimensions are medium-sized dimensions. The cross product of these two dimensions contains several million combinations. We know that, typically, the cube space is sparse; therefore, many of these combinations are indeed empty. The Filter()
operation is not optimized to work in block mode, which means a lot of calculations will have to be performed by the engine to evaluate the set on rows, whether the combinations are empty or not.
This is because the Filter()
function needs to iterate over the complete set of data in every cell in order to isolate a single cell. For this reason, the Filter()
function can be slow when operating on large dimensions or cross–join result of even medium-sized dimensions.
Tip
The Filter()
operation is not optimized to work in block mode. It filters a specified set based on a search condition by iterating through each tuple in the specified set. It's a cell-by-cell operation and can be very slow when operating on large dimensions. For a good explanation of the block mode versus cell-by-cell mode, please see The pluses and minuses of named sets section of
Chapter 5, Navigation.
Fortunately, the NonEmpty()
function exists. This function can be used to reduce any set, especially multidimensional sets that are the result of a crossjoin operation.
The NonEmpty()
function removes the empty combinations of the two sets before the engine starts to evaluate the sets on rows. A reduced set has fewer cells to be calculated, and therefore the query runs much faster.
There's more...
Regardless of the benefits that were shown in this recipe, the NonEmpty()
function should be used with caution. Here are some good practices regarding the NonEmpty()
function:
- Use it with sets, such as named sets and axes.
- Use it in the functions which are not optimized to work in block mode, such as with the
Filter()
function. - Avoid using it in aggregate functions such as
Sum()
. TheSum()
function and other aggregate functions are optimized to run in block mode. If you pass the data through theNonEmpty()
theSum()
function, which break it into many small non-empty chunks, you will turn this optimization off and those functions will run in a much slower cell-by-cell mode. - Avoid using it in other MDX set functions that are optimized to work in block mode. The use of the
NonEmpty()
function inside optimized functions will prevent them from evaluating the set in block mode. This is because the set will not be compact once it passes theNonEmpty()
function. The function will break it into many small non-empty chunks, and each of these chunks will have to be evaluated separately. This will inevitably increase the duration of the query. In such cases, it is better to leave the original set intact, no matter its size. The engine will know how to run over it in optimized mode.
NonEmpty() versus NON EMPTY
Both the NonEmpty()
function and the NON EMPTY
keyword can reduce sets, but they do it in a different way.
The NON EMPTY
keyword removes empty rows, columns, or both, depending on the axis on which that keyword is used in the query. Therefore, the NON EMPTY
operator tries to push the evaluation of cells to an early stage whenever possible. This way, the set on axis is already reduced and the final result is faster.
Take a look at the initial query in this recipe, remove the Filter()
function, run the query, and notice how quickly the results come, although the multi–dimensional set again counts millions of tuples. The trick is that the NON EMPTY
operator uses the set on the opposite axis, the columns, to reduce the set on rows. Therefore, it can be said that NON EMPTY
is highly dependent on members on axes and their values in columns and rows.
Contrary to the NON EMPTY
operator found only on axes, the NonEmpty()
function can be used anywhere in the query.
The NonEmpty()
function removes all the members from its first set, where the value of one or more measures in the second set is empty. If no measure is specified, the function is evaluated in the context of the current member.
In other words, the NonEmpty()
function is highly dependent on members in the second set, the slicer, or the current coordinate, in general.
Common mistakes and useful tips
If a second set in the NonEmpty()
function is not provided, the expression is evaluated in the context of the current measure at the moment of evaluation, and current members of attribute hierarchies, also at the time of evaluation. In other words, if you are defining a calculated measure and you forget to include a measure in the second set, the expression is evaluated for that same measure which leads to null, a default initial value of every measure. If you are simply evaluating the set on the axis, it will be evaluated in the context of the current measure, the default measure in the cube, or the one provided in the slicer. Again, this is perhaps not something you expected. In order to prevent these problems, always include a measure in the second set.
The NonEmpty()
function reduces sets, just like a few other functions, namely Filter()
and Existing()
. But what's special about NonEmpty()
function is that it reduces sets extremely efficiently and quickly. Because of that, there are some rules about where to position NonEmpty()
function in calculations made by the composition of MDX functions (one function wrapping the other). If we are trying to detect multi-select, that is, multiple members in the slicer, NonEmpty()
function should go inside, with the EXISTING
function/keyword outside. The reason is that although they both shrink sets efficiently, the NonEmpty()
function works great if the set is intact. The EXISTING
keyword is not affected by the order of members or compactness of the set. Therefore, the NonEmpty()
function should be applied earlier.
You may get System.OutOfMemory
errors if you use the CrossJoin()
operation on many large hierarchies because the cross join generates a Cartesian product of those hierarchies. In that case, consider using the NonEmpty()
function to reduce the space to a smaller subcube. Also, don't forget to group the hierarchies by their dimension inside the cross join.