Advanced – introducing window functions
A window function is a SQL function that operates on a set of rows, called a window of rows.
These functions are also called analytic functions, as they are mostly used in analytic queries.
A window is a set of correlated rows, containing any amount from 0 to all the rows resulting from a query.
Generally, the windows divide the resulting rows into disjointed partitions.
Some window functions can also operate on a window frame, which is a dynamic subset of the rows defined by the window, based on the current row.
You can define a cumulative window frame, taking the rows of the window before or after the current row, or a sliding window frame, going from some rows before the current row to some rows after the current row.
Window functions have a lot in common with aggregated functions, the ones that use GROUP BY
, as they both divide the results of a query into groups, that is, partitions, and then calculate the function on the rows in each group.
The big difference is that with GROUP BY
, you get only one row per group, while with window functions, all the rows are returned, with the function calculated on the desired subset of rows.
Let’s show some examples to make the concept more real.
Assume we have a table with orders, and we would like to know how much each order contributes to the total by day or month. We can calculate the percentage with the following formula:
monthly_pct = order_total / monthly_total * 100
Using group by, you could calculate the order total for each day or month, in separate queries, and then you would need to join those results with the original table to be able to apply the formula.
Window functions are here to help, as you can calculate the daily or monthly measures in a single step by defining a window that puts the rows together in different ways, as in the following query:
SELECT O_ORDERKEY, O_CUSTKEY, O_ORDERDATE, O_TOTALPRICE, avg(O_TOTALPRICE) over(partition by O_ORDERDATE) as daily_avg, sum(O_TOTALPRICE) over(partition by O_ORDERDATE) as daily_total, sum(O_TOTALPRICE) over(partition by DATE_TRUNC(MONTH, O_ORDERDATE)) as monthly_total, O_TOTALPRICE / daily_avg * 100 as avg_pct, O_TOTALPRICE / daily_total * 100 as day_pct, O_TOTALPRICE / monthly_total * 100 as month_pct FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS" QUALIFY row_number() over(partition by O_ORDERDATE order by O_TOTALPRICE DESC) <= 5 order by O_ORDERDATE, O_TOTALPRICE desc;
This example uses window functions to calculate totals and averages over different periods and also to limit the returned rows to the five biggest orders for each day.
This query is pretty efficient, executing in a few seconds even after processing million of rows. By changing the schema from TPCH_SF1
to TPCH_SF10
, TPCH_SF100
, or TPCH_SF1000
, you can try out bigger tables in the Snowflake sample data, to see how it works with bigger datasets.
Window definition
A window is defined using the OVER
clause, which specifies how to partition and order the rows of the incoming data to apply the window function.
To use a window function, you use this syntax:
<function_name> ([<arguments>]) OVER ( [PARTITION BY <part_expr>] [ORDER BY <order_expr>])
The presence of the OVER
clause specifies that the function is applied to the windows defined inside the OVER
clause. Of course, the function can also take its own arguments.
The PARTITION BY
clause defines how the rows are partitioned and <part_expr>
is a comma-separated list of columns or expressions calculated on the query data, in a similar way to what we have seen for the group by expressions and shown in the previous example.
The ORDER BY
clause specifies the ordering of the rows to be used to calculate the function and it is not visible outside of the window function. The order by clause and <order_expr>
work pretty much as they do in the SELECT
statement to order the results of the query.
At the syntax level, both the PARTITION BY
and ORDER BY
clauses are optional, even if some functions require or forbid one of the two. Look at each function’s documentation for more details.
The ORDER BY
clause has an effect on the order-sensitive window function, which comes in two types:
- Rank-related functions: These functions calculate values based on the rank of the row, which is the order of the row in the window. They are used to make explicit, that is, crystallized into a value that can be used later, one of many possible orderings of a row in a window.
One example is the ordering of teams in a championship based on points or medals in a sport: you can have different styles of ordering based on what is decided when you have the same value.
As an example, if two teams (A and B) have the same score, after another team (C) with an higher score, you can say that A and B are both second or you can say that one is second and the other one third.
- Window frame functions: These functions work on a window frame, which is a dynamic subset of rows of a window. As we saw previously, they can use cumulative or sliding window frames. The next section will delve into these.
Window frame definition
Some functions can use a dynamic window frame instead of a fixed window.
You specify the window frame inside the OVER
clause, after the ORDER BY
clause of the window:
<function_name> ([<arguments>]) OVER ( [PARTITION BY <part_expr>] ORDER BY <order_expr> <cumulative_frame_def> | <sliding_frame_def> )
A cumulative window frame is a subset of the window that is either growing or shrinking based on the current row inside the window. The frame contains the rows of the window either before or after the current row. The syntax of the cumulative frame definition is the two following lines:
{ROWS | RANGE} BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW | {ROWS | RANGE} BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
The first expression takes all the rows in the partition or all the rows in the same order range up to the current one, while the second takes from the current row to the end of the partition or range. A better description of ROWS
and RANGE
follows.
A sliding window frame is a subset of rows that extends from a number of N
rows before the current row to a number of M
rows after the current row. One or both sides can be fixed to the start or end of the partition.
The following are the three possible forms of the syntax:
ROWS BETWEEN <N> { PRECEDING | FOLLOWING } AND <M> { PRECEDING | FOLLOWING } | ROWS BETWEEN UNBOUNDED PRECEDING AND <M> { PRECEDING | FOLLOWING } | ROWS BETWEEN <N> { PRECEDING | FOLLOWING } AND UNBOUNDED FOLLOWING
The first syntax is when you want to create a frame that extends from N
rows before to M
rows after the current rows, while the other two are the cases when one of the two extremities is fixed to be the start or end of the window.
ROWS/RANGE in a cumulative window definition
When specifying a cumulative window frame, we can use either ROWS
or RANGE
, but what is the difference?
ROWS
: Indicates that the fixed side of the frame extends to the limit of the partition: from the start of the partition to the current row or from the current row to the end of the partition.RANGE
: Indicates that the fixed side of the frame extends to the previous/following rows that have the same value for theORDER BY
value. The frame operates on the preceding/following rows that are in the same position in the ordering. As an example, you can think of stepping through many teams that have the same number of points in a championship list.
Important note
Providing extensive information on window functions is beyond the scope of this introduction to SQL used in analytical operations. It is also one area where SQL is not always the same in all databases. We suggest searching for “window functions” in the documentation of your database.