In this section, you will learn how to query existing data using SQL and the Snowflake example database.
In this chapter, and more generally in this book, you will get familiar with Snowflake syntax, which is modern and very standard. We will also use some proprietary extensions that make your work easier.
Snowflake query syntax
Snowflake supports querying data with the standard SELECT
statement, which has the following basic syntax:
WITH …
SELECT …
FROM …
JOIN …
WHERE …
GROUP BY …
HAVING …
QUALIFY …
ORDER BY …
LIMIT …
The only mandatory part is select, so SELECT 1
is a valid query that just returns the value 1.
If you are familiar with SQL from other database systems, you will wonder what the QUALIFY
clause is. It is an optional SQL clause that is very well suited to the analytical kind of work that Snowflake is used for and that not all database engines implement. It is described later in this section.
We often use the terms query, command, and statement interchangeably when referring to some piece of SQL that you can execute.
Properly speaking, a command is a generic command such as SELECT
or CREATE <object>
, while a statement is one specific and complete instance of a command that can be run, such as SELECT 1
or CREATE TABLE
my_table …;
.
The term query should really only refer to SELECT
statements, as SELECT
statements are used to query data from the database, but query is often used with any statement that has to do with data.
You will also often hear the term clause used, such as the FROM
clause or GROUP BY
clause. Informally, you can think about it as a piece of a statement that follows the syntax and rules of that specific keyword.
The WITH clause
The WITH
clause is optional and can only precede the SELECT
command to define one or more Common Table Expressions (CTEs). A CTE associates a name with the results of another SELECT
statement, which can be used later in the main SELECT
statement as any other table-like object.
Defining a CTE is useful for the following:
- Clarity: You can provide an informative name to a piece of SQL
- Reuse, maintenance, and efficiency: You can define a supporting query, whose results you might use more than once in the main query, and the database engine will execute the supporting query once
- Creating recursive queries: Defining a CTE that queries itself is the only way to use recursion in SQL
- Break a complex transformation into smaller, simpler steps that are easier to code and understand and limit the scope of future maintenance
The simplified syntax is as follows:
WITH [RECURSIVE]
<cte1_name> AS (SELECT …)
[, <cte2_name> AS (SELECT …)]
SELECT …
We will check out some examples here:
- Check whether we still have high-priority orders pending:
WITH
high_prio_orders as (
SELECT *
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
WHERE O_ORDERPRIORITY IN ('1-URGENT', '2-HIGH')
)
SELECT count(*)
FROM high_prio_orders
WHERE O_ORDERDATE < '1998-01-01'
and O_ORDERSTATUS = 'O';
Here, you can see that the first expression encapsulates the business definition of high-priority order to be any order with the priority set to urgent or high priority. Without the CTE, you have to mix the business definition and other filtering logic in the WITH
clause. Then, it would be unclear whether the status is part of the definition or is just a filter that we are applying now.
- Calculate some metrics for customers in the auto industry:
WITH
auto_customer_key as (
SELECT C_CUSTKEY
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
WHERE C_MKTSEGMENT = 'AUTOMOBILE'
),
orders_by_auto_customer as (
SELECT O_ORDERKEY
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
WHERE O_CUSTKEY in (SELECT * FROM auto_customer_key)
),
metrics as (
SELECT 'customers' as metric, count(*) as value
FROM auto_customer
UNION ALL
SELECT 'orders by these customers', count(*)
FROM orders_by_auto_customer
)
SELECT * FROM metrics;
In this example, the CTEs and the final query remain short and very simple. You can start to appreciate that the clear labeling of intentions and the simple pieces of SQL make the full query easy to understand for anyone.
The SELECT clause
The SELECT
command can appear in two possible forms: as the central clause of a SELECT
statement or used as a clause in other statements. In both cases, it defines the set of columns and calculated values returned by the statement.
The simplified syntax of a SELECT
clause in Snowflake is a list of column definitions separated by commas, with the optional distinct
keyword to omit duplicates:
SELECT [DISTINCT]
<column_definition_1> [, <column_definition_2> …]
The SELECT
clause is used to provide the list of columns that the query will return.
For each column definition that is provided, it is also possible to provide a column alias, which will be used to identify that column definition in the results.
When two column definitions have the same name, an alias must be provided for at least one of the two to avoid a name clash in the results.
A column definition can be one of these four expressions:
<column_definition> =
[object.*] -- all columns in the object
|[object.col_name] -- the named column in the object
|[object.$n] -- the n-th column in the object
|[<expression>] -- the value of the expression
AS <col_alias>
Let’s describe in detail these four expressions:
- The star symbol,
*
, indicates all possible columns; it means “all columns from all tables” if it is applied without an alias or “all the columns from the aliased object” if applied with an alias.- Star without an alias will return all columns from both tables:
SELECT * FROM table_1, table_2
- Star with an alias (or table name) will return all columns from the object with the alias. In this case, it will return all columns from
table_1
, as it is applied to its alias:SELECT t1.* FROM table_1 as t1, table_2 as t2
- Snowflake has introduced two powerful extensions to the Star Syntax:
EXCLUDE
to remove some columns from being returned and RENAME
to rename a column while selecting it, as shown in this example:SELECT * EXCLUDE (c3, c2) RENAME (c4 as cx, c5 as cy)
FROM table_1
- A column name, optionally with an alias indicating the object the column comes from and/or an alias to use for the column in the results:
SELECT
O_ORDERKEY,
ord.O_CUSTKEY,
cust.C_NAME as CUSTOMER_NAME
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS as ord
JOIN SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER as cust
ON cust.C_CUSTKEY = ord.O_CUSTKEY;
Note that here, the designers of the TPCH database have gone a long way to keep column names unique between all tables, by adding the table initials as a prefix to column names. This is a pretty old style as it makes it more difficult to recognize fields that contain the same values as C_CUSTKEY
and O_CUSTKEY
.
Looking at our example query, we can see the following:
- Object aliases are optional if the name of the column is unique in the tables referenced in the query, as it is for all columns in the TPCH database.
- Aliases can in any case be used for clarity, like with
cust.C_NAME
.
- Object aliases are mandatory when referencing a column name that appears in more than one referenced table. This would have been the case if both fields had been just
CUSTKEY
, without the C_
and O_
table prefixes.
- Lastly, a column alias can always be used, even just to rename a column with an initial name that we do not like or is not clear, as in the case of a name that would not be clear, so we rename it to
CUSTOMER_NAME
.
- A column number after a dollar sign, such as $1 or $3.
This allows us to reference columns that do not have a name, such as when reading out of a CSV file without a header. This is also useful, especially in scripts, to reference columns that we do not know the name of, but we know their position:
SELECT $1 as ORDER_KEY, $2 as CUST_KEY
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS;
In this case, we read from a table that has column names, and we can always use column numbers instead of column names. But this is rarely done when reading from a table or a view because by using the numbers, the result will depend on the order of the columns in the object, which can change over time if the object is recreated.
- An expression, like a mathematical expression or a function call, that evaluates to some value for each row:
SELECT
P_PARTKEY
, UPPER(P_NAME) as P_NAME
, P_RETAILPRICE
, P_RETAILPRICE * 0.9 as P_DISCOUNTED_PRICE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."PART";
The preceding example shows two expressions: a mathematical expression that multiplies a column by a constant to produce a new column and a function call that converts the p_name
column to uppercase, keeping the same column name.
Important note
The ability highlighted here, to write arbitrary expressions and assign a name to the result, is at the core of how SQL is used to transform data and is one of the most common things that we will do with dbt.
In this section, we have seen that the SELECT
clause allows us four great abilities:
- To decide what data from the source tables to keep in the result of a query
- To calculate new data from source data using arbitrary complex expressions
- To provide a new name for the data that we want in the query result
- To keep only one copy of each distinct row in the result, using the
DISTINCT
keyword
The FROM clause
The FROM
clause introduces the table objects used in a SQL statement as the source of data or target of the command.
In the case of a SELECT
statement, the FROM
clause can list none, one, or more table objects, each with its own alias. The rows in the result of the SELECT
statement will potentially have all the columns from all the tables referenced in the FROM
clause.
The simplified syntax looks like this:
SELECT …
FROM [tableObject1 [AS alias1]]
[, tableObject2 [AS alias2] …]
Table objects, in the context of a SELECT
statement, are a combination of the following:
- Tables and views: Tables and views are the main data objects SQL works with, so most of the time, this is what you find in the
FROM
clause.
- Table functions: Table functions are a category of functions that return a set of rows, with one or more columns. They can be predefined functions or user-defined ones.
- A
VALUES
clause: This clause allows us to build a set of rows, using constant values. This allows us to create an inline table that can be suitable as a reference table, mapping table, or test input.
- A few other objects that can be read from, such as a
LATERAL
subquery or a staged file.
When no table is listed, then the columns in the SELECT
clause must use constants and global objects such as the current_date
function, as in the following example:
SELECT 1 + 1 as sum, current_date as today;
When we use the VALUES
clause, we can define an inline table, like in this example:
SELECT * FROM ( VALUES
('IT', 'ITA', 'Italy')
,('US', 'USA', 'United States of America')
,('SF', 'FIN', 'Finland (Suomi)')
as inline_table (code_2, code_3, country_name)
);
When the FROM
clause lists more than one table object, the result is the Cartesian product of the elements in these tables. To avoid a Cartesian explosion, it is important to include a WHERE
clause that restricts the combinations, keeping only the ones we want.
Important note
The Cartesian product is the result of pairing the rows of all the objects in the product in all possible ways. This means that the number of rows returned by the Cartesian product is the product of the number of rows in each object in the product.
The expression Cartesian explosion is often used to refer to the fact that the number of rows returned by the Cartesian product increases very quickly and can easily create performance issues.
The following query, thanks to the WHERE
clause, will generate a more meaningful and compact set of rows, ideally equal to the number of rows that exist in the LINEITEM
table:
SELECT count(*)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."LINEITEM" as l
,"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS" as o
,"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER" as c
,"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."PART" as p
WHERE o.O_ORDERKEY = l.L_ORDERKEY
and c.C_CUSTKEY = o.O_CUSTKEY
and p.P_PARTKEY = l.L_PARTKEY
;
In the previous query, we have used a WHERE
clause to keep in the results only the rows that are really related. We achieve this by using equivalence constraints between the fields of the related tables.
The previous query generates a result of exactly 6,001,215 rows, which is the number of rows in the LINEITEM
table, instead of the astronomical number resulting from the multiplication of the number of rows: 6,001,215 x 1,500,000 x 150,000 x 200,000, which is pretty much 2,7 x 10^23.
I’ll let you calculate the exact result…
The consequence is that the full query processes 6 million rows and runs in a few seconds, while the query without the WHERE
clause will try to process 10^18 times more rows, taking very many hours or days to complete, in the process using up a lot of resources for nothing, as the resulting data provides nothing of interest.
The exact correspondence in the count derives from the fact that for each line item, there is only one order, for each order only one customer, and so on.
Providing the relevant WHERE
clause will help with reducing the result by many orders of magnitude even if the keys don’t match so precisely. We will deal with this topic in more detail in Chapter 3, where we talk about data modeling.
This example illustrates well the power of SQL and why it is important to express what we want correctly.
Tip
As a basic rule, never do an unrestricted Cartesian product unless you know what you are doing and you are fine with the exploded number of rows that you will generate.
The JOIN clause
JOIN
is a subclause of the FROM
clause and is used to describe in a more precise way how to combine the rows of two tables. JOIN
cannot exist outside of a FROM
clause.
The simplified syntax of JOIN
looks like this:
SELECT …
FROM tableObject1 AS to1
[<join type>] JOIN tableObject2 as to2 ON <condition_A>
[[<join type>] JOIN tableObject3 as to3 ON <condition_B>]
…
The condition in each join is a Boolean expression, often using columns from the base to1
table and the joined table. It does not need to be an equality check and can use columns from multiple tables or even constants.
It is normal to chain more than one JOIN
inside a FROM
clause to pick data from multiple tables at once, and you generally do not need to think too much about the order of the joins, but it is good to know that the order might matter. In the case of Snowflake, the order is from left to right (think of the statement written on one row).
We will look in more detail into the JOIN
clause in the next section, where we will introduce the different types of joins and some examples.
We have seen that we can combine data from multiple tables, listing them in the FROM
clause, and that we can use the WHERE
clause to restrict the resulting Cartesian product to the combined rows that we actually want to keep. This use of FROM
and WHERE
is equivalent to the INNER JOIN
construct, which we will see in the Combining data in SQL – the JOIN clause section.
The WHERE clause
The WHERE
clause specifies the conditions that the rows involved in the command need to match.
The rows for which the expression evaluates to true
are kept/processed.
In the context of a SELECT
statement, the WHERE
clause limits the result of the query to the subset of rows that verify the overall condition set by the WHERE
clause.
The simplified syntax is as follows:
SELECT …
WHERE <predicate>
<predicate>
can be as simple as the literal true
or false
or a complex expression, including logical operators and database functions that evaluate to a Boolean value (true
, false
, or NULL
) for every row that the query processes.
Tip
An expression that returns a Boolean value is called a predicate.
The following example returns only the rows where the order total is greater than 500,000:
SELECT O_ORDERKEY, O_CUSTKEY, O_TOTALPRICE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
WHERE O_TOTALPRICE > 500000;
The following example does not return any row, but defines two columns with the same name and type as the columns in the source table and two with the name and type that we provided:
SELECT O_ORDERKEY,O_CUSTKEY, 1 as an_int, null::number as a_num
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
WHERE false;
This may be a good trick when you need to define some column names and types. Remember that you do not need a FROM
clause if you do not want to use columns from an existing table…
Important note
Be careful when NULL
is involved as it might not behave as you expect.
As an example, the NULL = NULL
expression evaluates to NULL
.
In a WHERE
clause, it means that the corresponding row is discarded.
We have seen in a previous example using the FROM
clause that the WHERE
clause can be used to specify some types of JOIN
. We suggest using the JOIN … ON …
syntax for joins and the WHERE
clause to filter rows. We will look at more examples in the Combining data in SQL - the JOIN clause section.
The GROUP BY clause
A GROUP BY
clause is used to calculate aggregate functions on groups of rows that produce the same value for the group by expression.
The simplified syntax is as follows:
SELECT …
GROUP BY groupExpr1 [, groupExpr2 …]
The group by expression can be one of the following:
- Column name: The result of the expression is the value of the column with the given name. All rows with the same value in the column are grouped.
The following query calculates the sum of all orders by each customer (customer key):
SELECT O_CUSTKEY, sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY O_CUSTKEY;
- Number: The value to group by is the value of the column in the given position in the result of the
SELECT
statement. The value is evaluated at the end of the statement, after applying any function or expression.
The following query is the same as the previous one, using the ordinal 1
instead of the O_CUSTKEY
column name to indicate the column to group by:
SELECT O_CUSTKEY, sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY 1;
- SQL expression: Any expression using any combination of data from the query. The value to group by is the result of the expression.
The following query calculates the total orders by year, using the year()
function, which returns the year from a date:
SELECT YEAR(O_ORDERDATE), sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY YEAR(O_ORDERDATE)
ORDER BY YEAR(O_ORDERDATE);
We have also added the ORDER BY
clause with the same expression as the GROUP BY
so that we get the output in a nice order. This is a pretty common pattern.
When more than one group by expression is provided, the rows are grouped together according to the values of all the group by expressions.
In the previous example, you could have used GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE)
to have the order total split by month (and year), like in the following:
SELECT YEAR(O_ORDERDATE),MONTH(O_ORDERDATE),sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE)
ORDER BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE);
The HAVING clause
The HAVING
clause filters the rows resulting from a GROUP BY
clause according to a predicate
:
SELECT …
GROUP BY …
HAVING <predicate>
The predicate is an expression that returns a Boolean value and references:
- Constants
- Expressions that appear in the
GROUP
BY
clause
- Aggregated functions that can be calculated according to the
GROUP
BY
clause
As an example, we could extend the example from the GROUP BY
topic to look at the months where we have less than 10,000 orders.
The following query does it:
SELECT YEAR(O_ORDERDATE), MONTH(O_ORDERDATE), sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE)
HAVING count(*) < 10000
ORDER BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE);
In this query, we have used the count()
aggregate function to count how many lines, one for each order, there are in one group.
Tip
The HAVING
clause is for GROUP BY
what the WHERE
clause is for the FROM
clause.
We will look at the third filtering clause, QUALIFY
, in the next topic.
The QUALIFY clause
The QUALIFY
clause filters on the results of the window functions, which is the third type of data calculation that we can have in a query after expressions and group by.
We will dedicate the last section of this chapter to window functions.
The QUALIFY
clause is an optional SQL clause that proves extremely useful in analytical workloads, so it is implemented in Snowflake.
To recap a bit of what we have seen, these are the three clauses that are used to filter data:
WHERE
applies to the data being read from the sources specified in the FROM
/ JOIN
clauses
HAVING
applies to the data that has been grouped by a GROUP
BY
clause
QUALIFY
applies to the data that has been calculated by a window function
One of the most common usages of the QUALIFY
clause together with the simple row_number()
window function is to defend against undesired duplicates or select one specific row with respect to other rows that represent different versions of the same object:
SELECT *
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."LINEITEM"
QUALIFY row_number()
over(partition by L_ORDERKEY, L_LINENUMBER
order by L_COMMITDATE desc ) = 1;
This query selects all fields from a table, just filtering out undesired rows for which the row_number
window function returns a value different from 1
.
The row_number
window function, assigns a progressive number from 1
onward to all the rows in the same window, following the row order, and then restarts from 1
with the next window.
Picking the rows where it is equal to 1 means keeping only the first row for each window.
The windows are defined by the over(…)
clause, which comprises a partition by
part that defines the windows, in a way similar to group by
, and an order by
that provides the ordering in the window. We will look in more detail at this in the section devoted to windows functions.
Tip
When using the QUALIFY
clause, as shown in the previous query, we avoid the need to create a column with the result of the row_number
window function and then use a subquery to filter on it. That is the normal way of doing this kind of filtering in a database that does not support the QUALIFY
clause. QUALIFY
is much simpler to read and use.
Now, let’s extend the order example from the previous section to select out the “good months,” when we have higher total sales than the average sales for the year they belong to:
WITH
monthly_totals as (
SELECT
YEAR(O_ORDERDATE) as year,
MONTH(O_ORDERDATE) as month,
sum(O_TOTALPRICE) as month_tot
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE)
)
SELECT year, month, month_tot
,avg(month_tot) over(partition by YEAR) as year_avg
FROM monthly_totals
QUALIFY month_tot > year_avg
ORDER BY YEAR, MONTH;
Here, we have put together a few of the clauses that we have seen so far, using a WITH
clause to define a CTE named monthly_totals
with our previous query to calculate monthly totals.
We have then defined a query that reads from the CTE and uses the avg
window function to calculate the average monthly sales for each month of the full year that a particular month belongs to. Then, we use the QUALIFY
clause to keep only the rows where the monthly total is greater than the yearly average.
Tip
The previous example illustrates why the window functions are also called analytical functions.
Important note
Please note that the window function calculates a value for each row using the defined windows, not changing the number of rows. In contrast, using the avg
aggregate function with a GROUP BY
clause would have reduced the number of rows to one per group.
The ORDER BY clause
The ORDER BY
clause specifies in which order the query or window function lays out the rows.
The simplified syntax is as follows:
SELECT …
ORDER BY orderExpr1 [ASC|DESC] [NULLS FIRST|LAST] [, orderExpr2 …]
Each order by expression can be made out of three parts:
- One expression that identifies what to order on. It can be either a column alias, a position ordinal, or an expression, as we have seen for the
GROUP BY
clause. Please refer to that section for more details.
- An optional direction for the sorting:
ASC
for ascending or DESC
for descending sort.
- An optional specification of how to sort
null
values: NULLS FIRST
or NULLS LAST
, which are self-explanatory.
Please look at previous sections for examples of ORDER BY
in action, in both queries and window function definitions with over()
.
The LIMIT/FETCH clause
The LIMIT
clause is used to restrict the number of rows returned by the query.
LIMIT
and FETCH
are synonyms, with slightly different syntax.
Let’s look at the LIMIT
syntax:
SELECT …
[ORDER BY …]
LIMIT <count> [OFFSET <start>]
The count
parameter is a number that specifies the maximum number of rows to return.
If the OFFSET
part is present, the returned rows are the ones after the start
position. This allows tools connecting to a database to retrieve all the results in chunks of the desired size. As an example, a LIMIT 10 OFFSET 10
clause would retrieve rows from 11 to 20.
Important note
If no ORDER BY
clause is present, the order of the rows is undefined and could differ in each execution of the same query. In this case, the result of a LIMIT
clause is non-deterministic because what rows are returned depends on the order in which the rows happen in the result set.
Query clause order of evaluation
In the previous sections, we have seen all the clauses that can appear in a SELECT
statement.
Now is a good time to bring your attention to the fact that these clauses are generally evaluated in the following specific order, as well as what it is important to pay attention to for each clause:
FROM
and its JOIN
subclause, which are used to identify the source data for the query.
- The
WHERE
clause, which is used to filter out the source data that we do not want.
This is probably the most important clause for performance, because the less data a query works on, the quicker it is. Use WHERE
whenever possible to just bring in the data you need.
- The
GROUP BY
clause, which groups the source data left after applying the WHERE
clause and calculates the aggregate functions on the grouped data.
- The
HAVING
clause, which filters on the results of GROUP BY
.
- Partitioning of the windows and calculation of the window functions.
- The
QUALIFY
clause, which filters on the results of the window functions.
- The
DISTINCT
keyword, if applied to the SELECT
clause, which removes duplicated rows.
- The
ORDER BY
clause, which puts the resulting rows in the desired order.
- The
LIMIT
clause, which caps the rows returned by the query to the desired amount.
SQL operators
When writing queries, we can perform operations on the data handled by the query.
We do so by building expressions that return the desired value, using functions and operators.
We can perform an operation pretty much everywhere a value is expected: in the SELECT
clause to provide the desired outputs by transforming the inputs, in the WHERE
clause or ON
part of a JOIN
clause, HAVING
clause, or QUALIFY
clause to identify what should or should not be returned by the query, in GROUP BY
to decide how to aggregate, and so on.
Let’s go through the categories of operators and how they are used:
- Arithmetic operators: These are the traditional +, -, *, /, and % (modulo).
They expect one or more numeric (or convertible to numeric) inputs to provide a numeric result, with the usual arithmetic precedence and meaning, like in the following example:
SELECT 1 + '2' as three, (3+2) * 4 as twenty
WHERE twenty % 2 = 0;
Note that '2'
is a string but can be automatically converted to the number 2. Also note that implicit conversions happen, but explicit conversions are better.
The modulo operator returns the remainder of the division of the first operator by the second, and the val % 2 = 0
pattern is often used to identify even numbers.
- Comparison operators: These are used to test two values for equality or other comparisons, and are
=
(equal), !=
(not equal), <>
(also not equal), <
(less than), <=
(less than or equal), >
(greater than), and >=
(greater than or equal).
They are typically used in the WHERE
clause, but can be used anywhere a Boolean result is desired, such as in the following example:
SELECT 2 < 1 as nope, '3' != 'three' as yep
WHERE 1 != 2;
Note that the operators can compare all types where the operation is defined.
I have seen WHERE 1 = 2
used in many places to avoid returning any row, because 1=2
always returns FALSE
; it would be clearer to write WHERE false
directly and if you do not need to filter any row out, you can just leave the WHERE
clause off, or use the WHERE
true
expression.
- Logical operators: These are the traditional
AND
, OR
, and NOT
Boolean operators.
They operate only on Boolean predicates and values and return Booleans.
They are generally used in the WHERE
clause, but can appear anywhere an expression is allowed, such as in the following example:
SELECT *,
(C_ACCTBAL > 7500) AND (C_NATIONKEY = 24) as IS_TOP_US_CUST
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE (C_NAME IS NOT null) AND IS_TOP_US_CUST;
In the previous example, we have defined an IS_TOP_US_CUST
column using a logical operator to combine two predicates, as we have done in the WHERE
clause, also reusing the new Boolean column we created.
- Set operators: These operators are used to put together the results from multiple queries, and are
INTERSECT
, MINUS
or EXCEPT
, and UNION [ALL]
.
The simplified syntax is as follows:
query_1 <set_operator> query_2
The queries must be compatible, having the same number of columns and of the same type.
It is also important that the semantics of the columns in the same position are correct, as combining people’s names with state codes is possible, as they are both strings, but in general, it does not make much sense.
Let’s describe the set operators:
INTERSECT
returns the rows that appear in both queries, checking all columns to have the same value
MINUS
or EXCEPT
returns the rows from the first query that do not appear in the second
UNION [ALL]
returns the rows from both queries, with ALL
keeping duplicates
As an example, the following query returns data for the customers from India (8) and the US (24) and the customers in the AUTOMOBILE
segment without duplicates:
SELECT C_NAME, C_ADDRESS, C_PHONE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE C_NATIONKEY IN (8, 24)
UNION
SELECT C_NAME, C_ADDRESS, C_PHONE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE C_MKTSEGMENT = 'AUTOMOBILE';
In this special case, as both the queries are on the same table, we could have just used a single query with a slightly more complex WHERE
clause using an OR
operator to compose the two individual clauses. In real cases, you might want to combine similar data from different tables and the set operators are here for you.
- Subquery operators: These operators allow us to use subqueries in
WHERE
clauses.
A subquery is a query defined inside another query.
A subquery can be used without any operator as a table-like object, such as selecting a subset of a table we want to use some data, or an expression if it returns a single value, as in the following example:
SELECT C_NAME, C_ADDRESS, C_PHONE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE C_NATIONKEY = (
SELECT N_NATIONKEY
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."NATION"
WHERE N_NAME = 'JAPAN'
);
In this example, the subquery returns only one value, so we can use the equal operator.
The subquery operators extend their use to other cases, as per their definitions:
ALL
/ANY
: Allows you to apply a comparison to all/any rows of the subquery
[NOT] EXISTS
: Returns true
if the subquery returns at least one row, false
with NOT
in front
[NOT] IN
: Returns true
if the expression is not included in the results of the subquery
The following example extends the previous example to the case when we want customers from more than one country, picking the country by name:
SELECT C_NAME, C_ADDRESS, C_PHONE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE C_NATIONKEY IN (
SELECT N_NATIONKEY
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."NATION"
WHERE N_NAME IN ('JAPAN', 'CANADA')
);
We could easily rewrite the same query using = ANY
instead of IN
, and probably with some changes in the subquery also using EXISTS
.
Now that we have been acquainted with the basics of querying data in SQL, let’s dive deeper into the JOIN
clause, which allows us to put together data from multiple sources, which is crucial to turning raw data into useful information.