A short SQL primer
SQL is a programming language that is specifically designed for querying, manipulating, and analyzing data. Even though SQL was originally developed in the early 1970s, it is widely used in modern data platforms and tools, with its adoption continuing to increase. SQL is a declarative language, which means that it allows us to focus on what we want to do with data, rather than having to worry about specifying the low-level steps for how to do it. It is also a rather versatile language, being frequently used across many types of applications, from ad hoc querying of data extracts to large-scale data processing pipelines and complex data analysis and reporting.
SQL’s versatility across analytical data applications, combined with its ubiquity and familiarity for many data practitioners, makes it a sensible choice for DuckDB to adopt as its primary language for managing, transforming, and querying data. It’s worth noting that SQL isn’t the only programmatic interface for interacting with DuckDB, as we’ll see later on in our DuckDB adventures. Given that many of the exercises in this book make use of SQL, in this section, we’ll go through a very short primer on some SQL fundamentals for readers who are newer to working with SQL or who haven’t used it for a while.
Creating your first DuckDB table
A database table is a collection of related data, organized in a tabular format consisting of rows and columns. Each row in a table represents a distinct record of the data being stored, while each column represents a specific attribute of the data stored in the table. Tables are an essential part of a database, providing a structured way to organize, store, and query data. Tables can also be linked to other tables through relationships, allowing for more complex data queries and analysis.
For our first example, let’s imagine we need to store information about food types in a database. Let’s create a table called foods
, with columns describing attributes of each food, such as food_name
, calories
, and is_healthy
. Each record in the foods
table would represent a unique food type, with specific information about each food item stored in corresponding columns.
Let’s now create our first DuckDB table. Creating a table in SQL involves specifying the table name, as well as the column name and data type for each column we want the table to have. The following SQL statement creates a simple table called foods
, with the columns we outlined previously. Run this now in an open DuckDB CLI shell:
CREATE TABLE foods ( food_name VARCHAR PRIMARY KEY, color VARCHAR, calories INT, is_healthy BOOLEAN );
Note that DuckDB allows you to write multiline commands, with a semicolon (;
) character being used to indicate the end of a SQL statement. It tells the DuckDB database engine that the current statement has ended and that it should be executed before moving on to the next statement.
You have now created a table named foods
with the following four columns:
food_name
: The name of the food. We are using aVARCHAR
data type, which is used to store variable-length character strings. Thefood_name
column is also specified as the table’sPRIMARY KEY
constraint. This constraint ensures that each row in the table has a unique value for this column.color
: The color of the food. This is also stored as aVARCHAR
data type.calories
: The calorie count of the food. This is stored as anINTEGER
data type, which is used to represent whole numeric values, without any decimal places.is_healthy
: An indicator of whether the food item is considered healthy. This is stored as aBOOLEAN
data type, which can only take one of two values:true
orfalse
.
Once a table is created, data can be inserted into it using the INSERT
statement. For example, the following SQL statement inserts a new record into the foods
table:
INSERT INTO foods (food_name, color, calories, is_healthy) VALUES ('apple', 'red', 100, true);
This inserts a new record with the values 'apple'
for the food_name
column, 'red'
for the color column, 100
for the calories
column, and true
for the is_healthy
column.
We can use the INSERT
statement to insert multiple records at a time. The following SQL statement inserts three new records into the foods
table:
INSERT INTO foods (food_name, color, calories, is_healthy) VALUES ('banana', 'yellow', 100, true), ('cookie', 'brown', 200, false), ('chocolate', 'brown', 150, false);
Running this statement results in three new records being inserted into our table, bringing this up to four food items in the table. There are a range of additional features that the INSERT
statement offers, which you can explore at the DuckDB documentation: https://duckdb.org/docs/sql/statements/insert.
Running your first DuckDB query
Now, let’s have a look at the data we added to the foods
table. To do this, we’ll use the SQL SELECT
command, which is used to retrieve data from one (or more) tables:
SELECT * FROM foods;
Running this query produces the following output:
┌───────────┬─────────┬──────────┬────────────┐ │ food_name │ color │ calories │ is_healthy │ │ varchar │ varchar │ int32 │ boolean │ ├───────────┼─────────┼──────────┼────────────┤ │ apple │ red │ 100 │ true │ │ banana │ yellow │ 100 │ true │ │ cookie │ brown │ 200 │ false │ │ chocolate │ brown │ 150 │ false │ └───────────┴─────────┴──────────┴────────────┘
Let’s unpack that query, which we can see has returned the four food items that we previously inserted into the table:
SELECT
: Specifies the columns we want to retrieve from the table. We selected all columns in the target table by using the asterisk symbol (*
), which functions as a wildcard. Alternatively, we could have explicitly listed one or more column names separated with commas, to return only a subset of columns.FROM
: Specifies the name of the table we want to retrieve data from; in this case, thefoods
table.
As we mentioned earlier in this chapter, DuckDB’s SQL dialect comes with a range of enhancements that extend traditional SQL syntax, with an eye toward a more user-friendly experience. One of these enhancements is the ability to omit the SELECT
clause from a SELECT
statement when returning all columns. This means that the query we just ran could be replaced with the following functionally identical and more concise query:
FROM foods;
When we created the foods
table, we set a primary key on the food_name
column. This instructs DuckDB to enforce the constraint that values in this column must be unique across all rows in the table. With this PRIMARY KEY
constraint defined on our table, we have ensured that there is no duplication of food items in the table. We can see this constraint in action by trying to add an extra record with the name 'apple'
again, but this time 'green'
in color:
INSERT INTO foods (food_name, color, calories, is_healthy) VALUES ('apple', 'green', 100, true); Error: Constraint Error: Duplicate key "food_name: apple" violates primary key constraint. If this is an unexpected constraint violation please double check with the known index limitations section in our documentation (https://duckdb.org/docs/sql/indexes).
This error indicates our insert failed, as we expected should happen. Let’s check we have only the single original red apple by querying the table again. This time, we’ll restrict the SELECT
clause to only retrieve values from the food_name
and color
columns, which are the values we need to check. We’ll also use a WHERE
clause to apply a filter to our query so that it only returns records with a food_name
value of 'apple'
:
SELECT food_name, color FROM foods WHERE food_name = 'apple';
This query returns a single result, as we expected:
┌───────────┬─────────┐ │ food_name │ color │ │ varchar │ varchar │ ├───────────┼─────────┤ │ apple │ red │ └───────────┴─────────┘
Now, if we did want to change the color of the existing apple
record, we could achieve this by modifying the value of its color
field using the UPDATE
statement. The following statement shows how we can do this:
UPDATE foods SET color = 'green' WHERE food_name = 'apple';
Note the use of the WHERE
clause to specify the condition that must be met for the rows to be updated. Because of this filter, our update is only applied to records where food_name
has the value 'apple'
.
Let’s verify for ourselves that the existing apple
record has had its color updated successfully by running the previous SELECT
statement again:
SELECT food_name, color FROM foods WHERE food_name = 'apple';
This time, we see our green apple:
┌───────────┬─────────┐ │ food_name │ color │ │ varchar │ varchar │ ├───────────┼─────────┤ │ apple │ green │ └───────────┴─────────┘
Lastly, another one of DuckDB’s SQL dialect enhancements is that when constructing a SELECT
statement, the FROM
clause can be placed before the SELECT
clause. This enables queries to follow a more natural sequence, with the data source being queried placed up front, before data-source-specific components of the query, such as columns to be retrieved and column filters to be applied. Using this SQL enhancement, the query that we just ran can be rewritten as follows:
FROM foods SELECT food_name, color WHERE food_name = 'apple';
This brings us to the end of our condensed primer on some of the basics of SQL, which we’ll be assuming you’re across as we dive into exploring DuckDB’s impressive range of SQL-based analytical capabilities. If you’re newer to working with SQL, and you feel like you could benefit from a more comprehensive introduction to SQL to pair with this book, you could consider reading Packt Publishing’s Learn SQL Database Programming by Josephine Bush. It’s also worth noting that, in the examples we’ve just worked through, we’ve only been working with individual records, as opposed to analytical operations over columns containing many records. As we work through the book, we will explore different types of SQL-defined analytical workloads and use cases that play to DuckDB’s strengths. We’ll also see ways in which you can work with DuckDB using alternative non-SQL interfaces, which may appeal to data scientists and data analysts working in Python or R in particular. By the end of the book, we think that you’ll see how DuckDB’s adoption of SQL as its core interface enables it to be an accessible, powerful, and flexible tool for managing analytical data workloads.