SQL basics – core concepts and commands
The SQL language is certainly big and complex, taking time to be fully mastered, but surprisingly, you can get productive with the limited set of features that we will introduce in this book, and you will pick up the nuances when you’ll need them while working on real use cases.
The goal of this chapter is not to make you a SQL expert but to get you started with it, getting deeper into the few commands that we will use often to make you quickly productive with dbt.
In this intro, we start with the most basic concepts of SQL and then provide an overview of the SQL commands by their categories of use. In the rest of this chapter, we will dig deeper into the commands that are mostly used when working with dbt: SELECT
, JOIN
, and analytical and window functions.
SQL core concepts
The main goal of SQL, as its name implies, is to allow users to query data that is contained in a database; SQL also provides all the commands to fully manage the database, allowing you to add, transform, delete, organize, and manage data and other database objects, such as users and roles.
The core concepts in SQL come from how an RDB is organized:
- A database contains tables, organized in schemata (plural of schema)
- Tables store data in rows that have one value for each column defined in the table
- Columns have a name and can contain only data of the declared data type
- To regulate access to the data, privileges are assigned to users and roles
You can see a database with schemata and tables in the following screenshot, which shows part of the sample database available in any Snowflake account:
Figure 1.1: Example database with schemata and tables
Let’s go through the core concepts of SQL, starting with the table.
What is a table?
The table is the most central concept in SQL, as it is the object that contains data.
A table in SQL is very close to the layman’s concept of a table, with data organized in columns and rows.
Columns represent the attributes that can be stored in a table, such as a customer’s name or the currency of an order, with each column defining the type of data it can store, such as text, a number, or a date.
In a table, you can have as many rows as you want, and you can keep adding more whenever you need. Each row stores in the table one instance of the concept that the table represents, such as a specific order in the following example of a table for orders:
Order_ID |
Customer_CODE |
Total_amount |
Currency |
123 |
ABC |
100 |
EUR |
166 |
C099 |
125,15 |
USD |
Table 1.1: Example order table
Looking at the previous example order table, we can see that the table has four columns that allow storing the four attributes for each row (order ID, customer code, total amount, and currency of the order). The table, as represented, has two rows of data, representing one order each.
The first row, in bold, is not data but is just a header to represent the column names and make the table easier to read when printed out.
In SQL, a table is both the definition of its content (columns and their types) and the content itself (the data, organized by rows):
- Table definition: It lists the columns that make up the table, and each column provides the data type and other optional details. The data type is mandatory and declares what values will be accepted in the column.
- Table content: It is organized in rows, each row containing one value for each column defined for the table or
null
if no value is provided. - Data value: All the values in a column must be compatible with the type declared for the column.
null
is a special value that corresponds to the absence of data and is compatible with all data types.
When creating a table, we must provide a name for the table and its definition, which consists of at least column names and a type for each column; the data can be added at a different time.
In the following code block, we have a sample definition for an order table:
CREATE TABLE ORDERS ( ORDER_ID NUMBER, CUSTOMER_CODE TEXT, TOTAL_AMOUNT FLOAT, ORDER_DATE DATE, CURRENCY TEXT DEFAULT 'EUR' );
Tip
When we write some SQL, we will use Snowflake’s SQL syntax and commands. We will guide you on how to create a free Snowflake account where you can run the code shown here.
In this example, we see that the command to create a table reads pretty much like English and provides the name of the table and a list of columns with the type of data each column is going to contain.
You can see that for the CURRENCY
column, of type text
, this code also provides the default
value EUR
. Single quotes are used in SQL to delimit a piece of text
, aka a string
.
View – a special way to look at some data
If you already have data and you want to make it available with some transformation or filtering, you can create a view. You can think of a view like a table, but with the column definition and data both coming from a query, reading from one or more tables.
As an example, if you would like to have a shortlist of orders with amount greater than 1,000 you could write the following query to create a BIG_ORDERS
view:
CREATE VIEW BIG_ORDERS AS SELECT * FROM ORDERS WHERE TOTAL_AMOUNT > 1000;
In this example, we see that this simple create view
statement provides the name for the view and uses a query, which is a SELECT
statement, to define what data is made available by the view.
The query provides both the data, all the orders with a total amount greater than 1,000, and the column definitions. The *
character – called star – is a shortcut for all columns in the tables read by the SELECT
statement.
This is, of course, a naïve example, but throughout this book, you will see that combining tables that store data and views that filter and transform the data coming from tables and views is the bread and butter of working with dbt. Building one object on top of the previous allows us to take raw data as input and provide as output refined information that our users can easily access and understand.
Tip
When working with dbt, you will not need to write create table
or create view
statements, as dbt will create them for us. It is nevertheless good to get familiar with these basic SQL commands as these are the commands executed in the database and you will see them if you look in the logs.
Database and schema
We have seen that a database is a container for tables and views.
A DB can be further divided and organized using schema objects.
In real-life DBs, the number of tables can range from a few units to many thousands. Schemata act pretty much like folders, but cannot be nested, so you can always identify a table by its database, schema, and name.
In the following screenshot, we see part of the contents of the SNOWFLAKE_SAMPLE_DATA database, which is available in all Snowflake accounts:
Figure 1.2: Some schemata and tables of the SNOWFLAKE_SAMPLE_DATA database
The database and schemata, as table containers, are the main ways information is organized, but they are also used to apply security, access limitations, and other features in a hierarchical way, simplifying the management of big systems.
To create the TEST
database and the SOME_DATA
schema in it, we can use the following commands:
CREATE DATABASE TEST; CREATE SCHEMA TEST.SOME_DATA;
The database.schema
notation, also known as a fully qualified name, allows us to precisely describe in which database to create the schema and after its creation, uniquely identifies the schema.
Tip
While working with dbt, you will create a database or use an existing one for your project; dbt will create the required schema objects for you if you have not created them already.
A best practice in Snowflake is to have one database for each project or set of data that you want to keep separate for administration purposes. Databases and schemata in Snowflake are soft boundaries, as all the data from all the databases and schemata can be accessed if the user has the appropriate privileges.
In some other database systems, such as PostgreSQL, a database is a stronger boundary.
Users and roles
To control access to your data in SQL, you GRANT
access and other privileges to both users and roles.
A user represent one individual user or service that can access the database, while a role represents a named entity that can be granted privileges.
A role can be granted to a user, providing them all the privileges associated with the role.
A role can also be granted to another role, building hierarchies that use simple basic roles to build more complex roles that are assigned to a user.
Using roles instead of granting privileges directly to users allows you to manage even a large number of users simply and consistently. In this case, roles are labels for a set of privileges and become a way to manage groups of users that we want to grant the same privileges. Changing the privileges granted to a role at any moment will change the privileges that the users receive from that role.
A typical pattern when working with dbt is to create a role for the dbt users and then assign it to the developers and the service user that the dbt program will use.
The following is an example of a simple setup with one role and a couple of users:
CREATE ROLE DBT_SAMPLE_ROLE; CREATE USER MY_NAME; -- Personal user CREATE USER SAMPLE_SERVICE; -- Service user GRANT ROLE DBT_SAMPLE_ROLE TO USER MY_NAME; GRANT ROLE DBT_SAMPLE_ROLE TO USER SAMPLE_SERVICE;
A more complex setup could have one role to read and one to write for each source system (represented by a schema with the data from the system), for the data warehouse (one or more schemata where the data is processed), and for each data mart (one schema for each data mart).
You could then control in much more detail who can read and write what, at the cost of more effort.
Understanding the categories of SQL commands
SQL commands can be organized into categories according to their usage in the language:
- Data Definition Language (DDL): DDL contains the commands that are used to manage the structure and organization of a database
- Data Manipulation Language (DML): DML contains the commands that are used to manipulate data, for example,
INSERT
,DELETE
, andUPDATE
- Data Query Language (DQL): DQL contains the
SELECT
command and is the central part of SQL that allows querying and transforming the data stored in a database - Data Control Language (DCL): DCL contains the
GRANT
andREVOKE
commands, which are used to manage the privileges that control the access to database resources and objects - Transaction Control Language (TCL): TCL contains the commands to manage transactions
In the upcoming sections, we provide more details about these by looking at Snowflake-specific commands, but the ideas and names are of general use in all database systems, with little or no change.
Data Definition Language – DDL
DDL commands do not deal directly with the data but are used to create and maintain the structure and organization of the database, including creating the tables where the data is stored.
They operate on the following objects:
- Account/session objects, which contain and operate on the data, such as user, role, database, and warehouse
- Database/schema objects, which store and manipulate the data, such as schema, table, view, function, and store procedure
The main commands are as follows:
CREATE
: Used to create the database itself and other objectsDROP
: Used to delete the specified objectALTER
: Used to modify some attribute of the specified objectDESC
: Used to describe the details of the specified objectSHOW
: Used to list the existing objects of the specified object type, with metadataUSE
: Used to select the database, schema, and object to use when fully specified names are not used
Tip
When working with dbt, we use the DDL and DML commands only in macros.
We do not use the DDL and DML commands in models because dbt will generate the required commands for our models based on the metadata attached to the model.
Data Manipulation Language – DML
DML provides the commands to manipulate data in a database and carry out bulk data loading.
Snowflake also provides specific commands to stage files, such as loading files in a Snowflake-managed location, called a stage.
The main commands are as follows:
INSERT
: Inserts rows into a tableDELETE
: Removes specified rows from a tableUPDATE
: Updates some values of specified rows in a tableMERGE
: Inserts, updates, or deletes rows in a tableTRUNCATE TABLE
: Empties a table, preserving the definition and privileges
For bulk data loading, Snowflake provides the following command:
COPY INTO
: Loads data from files in a stage into a table or unloads data from a table into one or more files in a stage
To manage files in stages, Snowflake provides these file-staging commands:
PUT
: Uploads a local file to a Snowflake stageGET
: Downloads a file from a stage to the local machineLIST
: Lists the files existing in a Snowflake stageREMOVE
: Removes a file from a Snowflake stage
Important note
In dbt, we can use macros with the COPY INTO
and file-staging commands to manage the data-loading part of a data pipeline, when source data is in a file storage service such as AWS S3, Google Cloud Storage, or Microsoft Azure Data Lake file storage.
Data Query Language – DQL
DQL is the reason why SQL exists: to query and transform data.
The command that is used to query data is SELECT
, which is without any doubt the most important and versatile command in all of SQL.
For the moment, consider that a SELECT
statement, aka a query, can do all these things:
- Read data from one or more tables
- Apply functions and transformations to the data retrieved
- Filter the data retrieved from each of these tables
- Group the retrieved data on a set of columns (or transformed values), producing one row for each group, calculating functions on the grouped rows, and filtering on the results of those functions
- Calculate for each row one or more functions based on groups of rows identified by a window expression and filter the results based on the results of these functions
Important note
We have dedicated the Query syntax and operators section later in this chapter to analyzing the SELECT
command in Snowflake in detail, as you will use the SELECT
command in every dbt model.
Data Control Language – DCL
DCL contains the GRANT
and REVOKE
commands, which are used to manage privileges and roles that control access to or use database resources and objects.
Together with the DDL commands to create roles, users, and other database objects, the DCL commands are used to manage users and security:
GRANT
: Assigns a privilege or a role to a role (or user)REVOKE
: Removes a privilege or a role from a role (or user)SHOW GRANTS
: Lists access privileges granted to a role or object
Transaction Control Language – TCL
The TCL commands are used to manage transactions in a database.
A transaction groups a set of SQL commands into a single execution unit and guarantees that either all the effects of all commands are applied, if the transaction completes with success, or no effect at all is applied if the transaction fails. This can also be described with the ACID acronym, which stands for atomic, consistent, isolated, and durable.
A transaction succeeds and ends only when all the commands it groups have finished with success; in any other case, the transaction fails, and its effects are rolled back like they never happened.
The TCL commands are as follows:
BEGIN
: Starts a transaction in the current sessionCOMMIT
: Commits an open transaction in the current session and makes the transaction effects visible to all sessionsROLLBACK
: Rolls back an open transaction in the current session and undoes all the effects applied since theBEGIN
statement
Now that we have covered the basic concepts and commands in SQL, it is time to set up a database to run them. The next section will provide you with access to a Snowflake DB.