You're reading from Getting Started with DuckDB A practical guide for accelerating your data science, data analytics, and data engineering workflows

Product type Paperback

Published in Jun 2024

Publisher Packt

ISBN-13 9781803241005

Length 382 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Authors (2):

Ned Letcher

Simon Aubury

View More author details

Table of Contents (15) Chapters

Preface

1. Chapter 1: An Introduction to DuckDB

2. Chapter 2: Loading Data into DuckDB FREE CHAPTER

3. Chapter 3: Data Manipulation with DuckDB

4. Chapter 4: DuckDB Operations and Performance

5. Chapter 5: DuckDB Extensions

6. Chapter 6: Semi-Structured Data Manipulation

7. Chapter 7: Setting up the DuckDB Python Client

8. Chapter 8: Exploring DuckDB’s Python API

9. Chapter 9: Exploring DuckDB’s R API

10. Chapter 10: Using DuckDB Effectively

11. Chapter 11: Hands-On Exploratory Data Analysis with DuckDB

12. Chapter 12: DuckDB – The Wider Pond

13. Index

Why subscribe?

14. Other Books You May Enjoy

A short SQL primer

SQL is a programming language that is specifically designed for querying, manipulating, and analyzing data. Even though SQL was originally developed in the early 1970s, it is widely used in modern data platforms and tools, with its adoption continuing to increase. SQL is a declarative language, which means that it allows us to focus on what we want to do with data, rather than having to worry about specifying the low-level steps for how to do it. It is also a rather versatile language, being frequently used across many types of applications, from ad hoc querying of data extracts to large-scale data processing pipelines and complex data analysis and reporting.

SQL’s versatility across analytical data applications, combined with its ubiquity and familiarity for many data practitioners, makes it a sensible choice for DuckDB to adopt as its primary language for managing, transforming, and querying data. It’s worth noting that SQL isn’t the only programmatic interface for interacting with DuckDB, as we’ll see later on in our DuckDB adventures. Given that many of the exercises in this book make use of SQL, in this section, we’ll go through a very short primer on some SQL fundamentals for readers who are newer to working with SQL or who haven’t used it for a while.

Creating your first DuckDB table

A database table is a collection of related data, organized in a tabular format consisting of rows and columns. Each row in a table represents a distinct record of the data being stored, while each column represents a specific attribute of the data stored in the table. Tables are an essential part of a database, providing a structured way to organize, store, and query data. Tables can also be linked to other tables through relationships, allowing for more complex data queries and analysis.

For our first example, let’s imagine we need to store information about food types in a database. Let’s create a table called foods, with columns describing attributes of each food, such as food_name, calories, and is_healthy. Each record in the foods table would represent a unique food type, with specific information about each food item stored in corresponding columns.

Let’s now create our first DuckDB table. Creating a table in SQL involves specifying the table name, as well as the column name and data type for each column we want the table to have. The following SQL statement creates a simple table called foods, with the columns we outlined previously. Run this now in an open DuckDB CLI shell:

CREATE TABLE foods (
    food_name VARCHAR PRIMARY KEY,
    color VARCHAR,
    calories INT,
    is_healthy BOOLEAN
);

Note that DuckDB allows you to write multiline commands, with a semicolon (;) character being used to indicate the end of a SQL statement. It tells the DuckDB database engine that the current statement has ended and that it should be executed before moving on to the next statement.

You have now created a table named foods with the following four columns:

food_name: The name of the food. We are using a VARCHAR data type, which is used to store variable-length character strings. The food_name column is also specified as the table’s PRIMARY KEY constraint. This constraint ensures that each row in the table has a unique value for this column.
color: The color of the food. This is also stored as a VARCHAR data type.
calories: The calorie count of the food. This is stored as an INTEGER data type, which is used to represent whole numeric values, without any decimal places.
is_healthy: An indicator of whether the food item is considered healthy. This is stored as a BOOLEAN data type, which can only take one of two values: true or false.

Once a table is created, data can be inserted into it using the INSERT statement. For example, the following SQL statement inserts a new record into the foods table:

INSERT INTO foods (food_name, color, calories, is_healthy)
VALUES ('apple', 'red', 100, true);

This inserts a new record with the values 'apple' for the food_name column, 'red' for the color column, 100 for the calories column, and true for the is_healthy column.

We can use the INSERT statement to insert multiple records at a time. The following SQL statement inserts three new records into the foods table:

INSERT INTO foods (food_name, color, calories, is_healthy)
VALUES ('banana', 'yellow', 100, true),
       ('cookie', 'brown', 200, false),
       ('chocolate', 'brown', 150, false);

Running this statement results in three new records being inserted into our table, bringing this up to four food items in the table. There are a range of additional features that the INSERT statement offers, which you can explore at the DuckDB documentation: https://duckdb.org/docs/sql/statements/insert.

Running your first DuckDB query

Now, let’s have a look at the data we added to the foods table. To do this, we’ll use the SQL SELECT command, which is used to retrieve data from one (or more) tables:

SELECT *
FROM foods;

Running this query produces the following output:

┌───────────┬─────────┬──────────┬────────────┐
│ food_name │  color  │ calories │ is_healthy │
│  varchar  │ varchar │  int32   │  boolean   │
├───────────┼─────────┼──────────┼────────────┤
│ apple     │ red     │      100 │ true       │
│ banana    │ yellow  │      100 │ true       │
│ cookie    │ brown   │      200 │ false      │
│ chocolate │ brown   │      150 │ false      │
└───────────┴─────────┴──────────┴────────────┘

Let’s unpack that query, which we can see has returned the four food items that we previously inserted into the table:

SELECT: Specifies the columns we want to retrieve from the table. We selected all columns in the target table by using the asterisk symbol (*), which functions as a wildcard. Alternatively, we could have explicitly listed one or more column names separated with commas, to return only a subset of columns.
FROM: Specifies the name of the table we want to retrieve data from; in this case, the foods table.

As we mentioned earlier in this chapter, DuckDB’s SQL dialect comes with a range of enhancements that extend traditional SQL syntax, with an eye toward a more user-friendly experience. One of these enhancements is the ability to omit the SELECT clause from a SELECT statement when returning all columns. This means that the query we just ran could be replaced with the following functionally identical and more concise query:

FROM foods;

When we created the foods table, we set a primary key on the food_name column. This instructs DuckDB to enforce the constraint that values in this column must be unique across all rows in the table. With this PRIMARY KEY constraint defined on our table, we have ensured that there is no duplication of food items in the table. We can see this constraint in action by trying to add an extra record with the name 'apple' again, but this time 'green' in color:

INSERT INTO foods (food_name, color, calories, is_healthy)
VALUES ('apple', 'green', 100, true);
Error: Constraint Error: Duplicate key "food_name: apple" violates primary key constraint. If this is an unexpected constraint violation please double check with the known index limitations section in our documentation (https://duckdb.org/docs/sql/indexes).

This error indicates our insert failed, as we expected should happen. Let’s check we have only the single original red apple by querying the table again. This time, we’ll restrict the SELECT clause to only retrieve values from the food_name and color columns, which are the values we need to check. We’ll also use a WHERE clause to apply a filter to our query so that it only returns records with a food_name value of 'apple':

SELECT food_name, color
FROM foods
WHERE food_name = 'apple';

This query returns a single result, as we expected:

┌───────────┬─────────┐
│ food_name │  color  │
│  varchar  │ varchar │
├───────────┼─────────┤
│ apple     │ red     │
└───────────┴─────────┘

Now, if we did want to change the color of the existing apple record, we could achieve this by modifying the value of its color field using the UPDATE statement. The following statement shows how we can do this:

UPDATE foods
SET color = 'green'
WHERE food_name = 'apple';

Note the use of the WHERE clause to specify the condition that must be met for the rows to be updated. Because of this filter, our update is only applied to records where food_name has the value 'apple'.

Let’s verify for ourselves that the existing apple record has had its color updated successfully by running the previous SELECT statement again:

SELECT food_name, color
FROM foods
WHERE food_name = 'apple';

This time, we see our green apple:

┌───────────┬─────────┐
│ food_name │  color  │
│  varchar  │ varchar │
├───────────┼─────────┤
│ apple     │ green   │
└───────────┴─────────┘

Lastly, another one of DuckDB’s SQL dialect enhancements is that when constructing a SELECT statement, the FROM clause can be placed before the SELECT clause. This enables queries to follow a more natural sequence, with the data source being queried placed up front, before data-source-specific components of the query, such as columns to be retrieved and column filters to be applied. Using this SQL enhancement, the query that we just ran can be rewritten as follows:

FROM foods
SELECT food_name, color
WHERE food_name = 'apple';

This brings us to the end of our condensed primer on some of the basics of SQL, which we’ll be assuming you’re across as we dive into exploring DuckDB’s impressive range of SQL-based analytical capabilities. If you’re newer to working with SQL, and you feel like you could benefit from a more comprehensive introduction to SQL to pair with this book, you could consider reading Packt Publishing’s Learn SQL Database Programming by Josephine Bush. It’s also worth noting that, in the examples we’ve just worked through, we’ve only been working with individual records, as opposed to analytical operations over columns containing many records. As we work through the book, we will explore different types of SQL-defined analytical workloads and use cases that play to DuckDB’s strengths. We’ll also see ways in which you can work with DuckDB using alternative non-SQL interfaces, which may appeal to data scientists and data analysts working in Python or R in particular. By the end of the book, we think that you’ll see how DuckDB’s adoption of SQL as its core interface enables it to be an accessible, powerful, and flexible tool for managing analytical data workloads.