Getting Started with DuckDB

An Introduction to DuckDB

Data is everywhere, stored in a huge variety of systems across many different formats, and with an ever-growing number of tools available to data practitioners to practice their craft. DuckDB is a relatively new and explosively popular database management system (DBMS) that is increasingly being adopted for analytical data workloads by data scientists, data analysts, data engineers, and software engineers. DuckDB is open source software that is made available under the permissive MIT license, making it friendly to both commercial and non-commercial applications alike. The non-profit DuckDB Foundation stewards the long-term health of the DuckDB project, and the development of DuckDB is supported by DuckDB Labs, which employs the project’s core contributors.

In this chapter, we’ll unpack what type of database DuckDB is and identify use cases that DuckDB is well suited to and that data practitioners are increasingly adopting it for. We’ll also outline the different deployment options DuckDB comes with and take you through how to install it on your own system so that you’re ready to dive into the hands-on examples in this book. Finally, we’ll go through a quick primer on Structured Query Language (SQL), the query language DuckDB uses for its primary interface that we’ll be using for many of the exercises in this book. If you’ve wrangled your fair share of SQL before, you may want to just skim through this section. If you’re newer to using SQL, or it’s been a while between queries, then you’ll want to dive into these hands-on exercises.

By the end of this chapter, you’ll be able to orient DuckDB within the landscape of data tooling and understand what kinds of use cases you may want to consider leveraging it for, as well as be able to recognize when other data processing tooling may be more appropriate.

Across the rest of the book, we’ll show you how to take DuckDB through its paces, and in doing so, hopefully impart a sense of why there is so much enthusiasm around it. Right now, let’s jump into setting the scene for our DuckDB explorations by covering the following topics:

What is DuckDB?
Why use DuckDB?
DuckDB deployment options and installation
A short SQL primer

What is DuckDB?

Whether you’re an experienced data practitioner or just getting started working with data, you will almost certainly find yourself having to navigate the dizzying number of databases and data processing tools that you can choose from to support data-centric applications and operational systems. The reason for this overwhelming choice is that when it comes to data processing and management architectures, there is no one-size-fits-all. Each tool necessarily comes with its own set of trade-offs that make it well suited to a particular flavor of application and less so to others.

With that in mind, let’s dig into what kind of database DuckDB is and where it sits in the data-tooling landscape so that we can unpack what kinds of applications and use cases it is well suited to. One description of DuckDB, which you might encounter when poking around online resources, is the following:

DuckDB is an in-process SQL OLAP DBMS.

While this is a fairly dense description, invoking several distinct concepts from the world of databases and software applications, it does a great job of positioning where DuckDB sits in relation to other databases and data processing tools. So, let’s break this description down, going through each component and working our way from right to left:

A database management system (DBMS) is a software application for managing structured data in a database, allowing users and applications to store, manipulate, delete, and query records. While you might hear the term database being used as shorthand for DBMS, it’s worth noting that a DBMS provides additional functionality on top of the core features of a database—which is essentially to store data in a structured format that supports efficient retrieval and manipulation. A DBMS provides an interface between the database and its users, enabling them to effectively create, read, update, and delete data, while also managing the integrity, scalability, and security of the database. DuckDB is a fully-fledged DBMS that manages all these concerns for users.
Online analytical processing (OLAP) is a data processing paradigm that is characterized by complex queries over large volumes of multidimensional data, which often involve processing significant portions of a dataset. These analytical workloads often involve applying column-wise aggregation functions over entire tables and joining large tables together. The term was created in contrast to online transaction processing (OLTP), which describes transaction-oriented DBMS tools, such as PostgreSQL, MySQL, and SQLite, which are typically used as operational databases supporting software applications, where frequent reading and writing of individual records is the dominant access pattern. DuckDB is designed and optimized for fast and efficient performance over OLAP workloads.
SQL is a popular programming language used for storing, manipulating, and querying records in a wide variety of databases and data stores. It is a standard interface used for interacting with and managing relational databases, which are databases characterized by the representation of data as tables of rows and columns, with formal relationships defined across tables. SQL’s increasing ubiquity has made it something of a de facto choice for code-defined data-querying interfaces. DuckDB has its own SQL dialect, which forms the primary interface for interacting with DuckDB databases. As we will see, there are also non-SQL interfaces available for users to work with DuckDB databases. In the last section of this chapter, A short SQL primer, we’ll cover a brief introduction to the fundamentals of working with SQL for those who are new to working with it or a little rusty.
In-process means that DuckDB runs embedded within a host process. This is in contrast to most DBMSs, which typically operate standalone, running in a separate process from consuming applications, often on a remote server. By adopting an in-process model rather than a client-server architecture, DuckDB greatly simplifies installation and integration, removing the need to install and manage a standalone DBMS service, as well as the need to connect and authenticate with a remote server. A notable example of an in-process DBMS that you may have encountered is SQLite, which is a popular choice for software developers distributing apps that require reading and writing local transactional data, such as user data for mobile apps and lightweight web apps.

Putting all these pieces together, we can see that DuckDB is a fully featured relational DBMS (RDBMS) that is designed for analytical workloads, provides a SQL interface, and runs entirely embedded in a host process.

When compared with other popular databases, DuckDB is perhaps most similar to the ubiquitous SQLite in that they are both simple in-process DBMSs that write to a single-file storage format, and they are also both free and open source. The key difference between the two tools is that SQLite is optimized for row-oriented OLTP workloads and hence does not perform well on complex analytical workloads, whereas DuckDB is purpose-built for these workloads, offering extremely good performance over them. It’s for this reason that DuckDB is sometimes described as SQLite for OLAP. In fact, DuckDB appears to be the first production-ready in-process OLAP DBMS.

In the next section, we’ll explore the reasons why people are increasingly adopting DuckDB and finding it to be a valuable workhorse in their analytical data toolkit.

Why use DuckDB?

So, why might you want to use DuckDB? Let’s start by zooming all the way out. As a data practitioner, there are two broad contexts where you might find yourself getting excited about leveraging DuckDB:

Using DuckDB to scale and supercharge data science, data analytics, and ad hoc data-wrangling workflows.
Using DuckDB as a building block to build operational data engineering infrastructure and interactive data products.

The first of these is likely to be of interest to data practitioners with analytical workflows, such as data scientists, data analysts, and machine learning engineers, whereas the second is more likely to be relevant to data engineers and machine learning engineers building data infrastructure, as well as software engineers building user-facing data products. In this book, we’ll be focusing more on using DuckDB to supercharge analytical workflows; however, if you’re looking to use DuckDB for building operational data infrastructure and data products, this book will still be a great starting point to get you up to speed with DuckDB’s capabilities that make it well suited to these kinds of applications.

In this section, we’ll first go through some use cases that land in DuckDB’s sweet spot, before looking at DuckDB’s features that make it especially well-suited to these applications. We’ll finish up by discussing contexts where other tools may be more appropriate.

DuckDB use cases

DuckDB is an incredibly versatile tool for analytical data processing and management, so any attempt to describe its full range of potential applications will almost certainly be incomplete. To give you a sense of the flavor of possible applications, we’ll go through a range of use cases for DuckDB across the two broad categories mentioned previously: analytical workflows and building operational data infrastructure and products.

Supporting analytical workflows

A major component of the workflows of data scientists and data analysts is activities that involve processing often quite large datasets, from cleaning data, transforming data into the right shape, structured data modeling, running statistical algorithms, and training machine learning models. If you talk to a practitioner who has been in the trenches for a while, they will likely tell you that sometimes they feel like they’re spending more time fighting the tools they use for these tasks than they are being productive. Often the size of the data is a limiting factor, with many popular data processing tools, such as pandas dataframes in Python and dataframes in R, simply not being able to handle the size of target datasets within the memory of your workstation or, if they can, taking a frustrating amount of time to process.

Once you’ve hit the limits of your local machine, conventional wisdom is that you need to take your workload to a distributed data compute framework such as Apache Spark, Dask, or Ray, or perhaps ingest your data into a cloud data warehouse or a data lake, where a distributed SQL query engine such as Google BigQuery, Trino, or Amazon Athena can be used to run queries at scale. These solutions significantly increase the complexity of your workflows, requiring complex supporting infrastructure that must be managed and maintained, with a hefty price tag often associated with such managed services. If you’re lucky enough to have access to these tools, they still come with additional challenges, such as working with unfamiliar or constrained interfaces, and when things go wrong, you may often find yourself having to debug arcane and confusing stack traces from the underlying compute engine.

This is where DuckDB can come to the rescue, offering the simplicity of an in-process tool, with a familiar SQL interface (as well as non-SQL interfaces if you prefer) that is optimized for running complex OLAP queries over large datasets. Not only is DuckDB blazingly fast, but it is also able to handle out-of-core workloads (datasets that don’t fit into memory), enabling you to scale your workflows on a single machine much further before you need to consider more complex distributed data processing solutions.

In recent times, there have been developments in dataframe libraries that help address the performance limitations of tools such as pandas dataframes and R dataframes, such as Dask and Modin, which allow you to perform simple parallelization of dataframe operations across your CPU cores, as well as providing on-ramps to run the same queries across a distributed cluster. We also have dataframe libraries such as Polars and Vaex, which are built on top of Apache Arrow, providing more efficient memory utilization, parallelization, and the ability to handle some out-of-core workloads. These innovations in the data ecosystem are pleasing to see; however, these tools are still ultimately dataframe tools, focusing primarily on querying and data transformation—they do not give you the data management features of a DBMS.

By virtue of being a fully-fledged DBMS, DuckDB provides a range of affordances that data practitioners may not realize they’re missing from their current analytical processing workflows:

DuckDB provides transactional guarantees through ACID properties (atomicity, consistency, isolation, and durability), meaning that you don’t have to worry about corrupted data if your Python or R process crashes midway through a job.
Data integrity guarantees can be enabled through the specification of constraints that enforce properties over data inserted into tables. DuckDB allows you to specify PRIMARY KEY constraints, which enforce uniqueness across rows within a table, FOREIGN KEY constraints, which enforce referential integrity for relationships across tables, and NOT NULL constraints over column values. DuckDB also provides the ability to apply arbitrary CHECK constraints to column values in the form of Boolean expressions, such as ensuring that string identifiers only contain alphanumeric characters.
While you can use DuckDB as an entirely in-memory database, its database can also be persisted to disk and used across processes, even allowing multiple processes to read concurrently. This enables workflows and consuming patterns that dataframe libraries cannot readily support on their own.
DuckDB also includes a rich suite of data integrations, with an eye toward performance. Notable examples include optimized CSV, Parquet, and JSON loaders, which can read files in parallel, the ability to read Hive-partitioned files, and the PostgreSQL, MySQL, and SQLite extensions, which allow DuckDB to query directly from source tables in external databases, rather than having to rely on bulk imports that must be periodically refreshed.

When a data team starts to hit the limits of their existing tooling, whether due to missing data management features or insufficient performance, it’s common for the team to start building out their own bespoke tools and packages. Since this kind of custom tooling is typically not core to the value the team is providing, these resources can suffer from defects due to insufficient resources being able to be dedicated to their development. Using a well-maintained and tested DBMS that is optimized for analytical workloads removes the busywork that is associated with maintaining tooling that doesn’t represent your core value proposition.

DuckDB’s powerful feature set makes it a versatile tool for a range of analytical workflows, whether you’re performing exploratory data analysis (EDA), quickly transforming between common data formats, or building complex data science pipelines. DuckDB enables you to slurp up large datasets from across heterogeneous data sources, with a rich set of features for cleaning dirty data and normalizing inconsistent schema, through a simple interface and with blazing performance. DuckDB also has great integrations with familiar analytical tools commonly used in the data ecosystem, allowing you to mix and match DuckDB with complementary tools to assemble your own effective workflows. For example, DuckDB can query directly from and write to pandas and Polars dataframes and R dataframes, as well as Apache Arrow tables. It also offers the ability to use alternative query interfaces to SQL that may be more familiar to data scientists and data analysts, such as dplyr in R and Ibis in Python. In addition to being a powerful workhorse for complex analytical queries, all this versatility makes DuckDB a valuable Swiss Army knife that is worth having in your analytical data toolkit.

Finally, data scientists and data analysts often find themselves building custom interactive data apps or dashboards for use as proof of concepts (POCs), bespoke tools that support common workflows, or for publishing internal decision-support tools within their organization. Powerful open source dashboarding tools such as R Shiny, Streamlit, and Plotly Dash streamline the development of such data apps; however, they typically leave the integration of a data source up to the developer. DuckDB is a perfect complement to these tools, offering simple in-process integration with no external dependencies and enabling fast analytical querying performance, which is important for low-latency response times that improve the user experience of your data apps. We unpack this particular application of DuckDB further in the DuckDB-powered data apps section in Chapter 12.

Building data infrastructure

While much of the explosive growth and excitement that DuckDB has seen has been driven by folks adopting it for the types of analytical workflows we have just discussed, there is another area of application that is starting to show increased amounts of activity and demand for, which draws upon similar themes of doing more with less and simplifying and streamlining workloads. This sees DuckDB being used as a building block in modern data infrastructure for use cases that involve small-to-medium data rather than truly big data, as well as use cases that require low-latency responsiveness for consumer-facing interactive data apps. Common to these applications is a shift away from the paradigm of moving your compute to the data, which is seen as the conventional wisdom for effectively working with big data, and a move toward bringing your data to the compute. For smaller data workloads, this can be faster, more efficient, and cheaper to build and maintain.

Much of the development in modern data processing technologies has been dominated by the needs of hyperscale organizations, with scale-out tools such as MapReduce, Hadoop, and Apache Spark, as well as cloud data warehouses such as Snowflake and BigQuery, dominating the landscape. Most organizations, however, do not operate at hyperscale, and oftentimes data processing needs are quite moderate in comparison. The cleaned and enriched datasets that drive the modern data-informed business—providing business intelligence (BI) across sales, marketing, growth, and product innovation—tend not to reach the petabyte scale. There is an opportunity for data teams in many organizations to adopt leaner data architectures that are optimized for more moderate data workloads and that come with the benefits of reduced complexity and much lower total cost of ownership.

DuckDB’s performance characteristics make it well placed to be a core building block in such architectures. Some examples include the following:

Using DuckDB to perform transformations in extract, transform, and load (ETL) pipelines as an alternative to tools such as Apache Spark. Compute instances can be spun up on demand, invoking DuckDB to pull down and transform data.
For data lake contexts, where structured and semi-structured data has been landed in object storage, DuckDB can be used as a lightweight alternative to distributed SQL query engines, which data teams might otherwise reach for, such as Google BigQuery, Amazon Athena, Trino, and Dremio.
For some scenarios, DuckDB also offers the potential to replace the use of cloud data warehouses such as Snowflake or OLAP engines such as ClickHouse, where utilization of these powerful resources would be low. If your organization is only consuming a handful of data sources to produce conformed tables that drive a small number of reporting use cases, then using DuckDB to build small, targeted data cubes may well be sufficient for your needs.

Some folks have already started to roll their own solutions for adopting these architectures. See, for example, the Modern Data Stack in a Box with DuckDB post by Jacob Matson (https://duckdb.org/2022/10/12/modern-data-stack-in-a-box.html), which explores the use of open source tools to create an end-to-end lightweight data stack, with DuckDB at its core. Another post, Build a poor man’s data lake from scratch with DuckDB, by Pete Hunt and Sandy Ryza (https://dagster.io/blog/duckdb-data-lake), explores using DuckDB as a SQL query engine on top of a data lake. Meanwhile, there are also companies emerging that are oriented around offering hosted platforms that provide serverless analytics platforms driven by DuckDB, the most notable example being MotherDuck (https://motherduck.com).

Another area where traditional scale-out approaches to data processing have shown to be not always fit for purpose is around interactive data applications, such as BI dashboards and bespoke data apps. In such applications, low-latency query results in response to user interaction are crucial for supporting dynamic and ad hoc workloads with a positive user experience. However, most cloud data warehouses and distributed data processing engines are simply not able to provide the low-latency response times required for these types of workloads and must be augmented with different types of pre-aggregation and caching strategies, often in the form of separate service, which further increases complexity and architectural surface area. DuckDB’s blazing fast speeds over analytical workloads make it a compelling choice for being the backing query engine for interactive data applications. For example, the hosted BI service Mode recently switched to using DuckDB as their in-memory query engine in order to improve the speed of queries (https://mode.com/blog/how-we-switched-in-memory-data-engine-to-duck-db-to-boost-visual-data-exploration-speed). Hex and Observable are two hosted data analytics notebook services offering rich visualizations and interactivity that both recently added DuckDB integration to supercharge users’ workflows. Another notable example is Rill Data Developer, an open source tool for building dashboards, which is built on DuckDB to provide rapid response times for queries.

The use of DuckDB as a building block for data infrastructure and interactive data applications is a notable emerging trend and one we think is worth paying attention to. In the next section, we’ll further unpack the features of DuckDB that serve to make it appealing for both analytical workflows and building operational data infrastructure and data products.

DuckDB features

You may find yourself asking, what makes DuckDB so well suited to scaling analytical workflows and being used as a building block in data infrastructure? Here are some key features of DuckDB that have led to it increasingly being adopted by data practitioners.

Performance

DuckDB is optimized for OLAP workloads, making it blazingly fast and efficient for the kinds of queries frequently seen in analytical workflows. It achieves this through a range of design choices and optimizations:

As with most modern OLAP engines, DuckDB employs a column-based execution model to enable better performance over operations that are characteristic of analytical workloads. DuckDB uses a highly tuned vectorized query engine that works by processing chunks of columns at a time. Operating on column chunks rather than entire columns means that queries involving multiple operations, which require intermediate aggregations, are less likely to result in out-of-memory errors. The chunks are also tuned to be small enough so that they remain inside the CPU’s ultra-low latency L1 cache—the CPU’s fasted dedicated memory, which is drastically faster than main memory.
DuckDB leverages a range of compression algorithms, which exploit similarities in values within columns, to reduce its storage size on disk, which in turn improves read times.
DuckDB employs an end-to-end query optimizer. This means that rather than executing queries as they are written, DuckDB can automatically rewrite queries to be much more efficient.
Almost all of DuckDB’s operations come with automatic parallelism, allowing it to distribute operations over multiple CPU threads, resulting in reduced processing time.

DuckDB is also able to support out-of-core workloads, where the data to be processed does not fit within available memory. It does this by spilling over into temporary disk storage when memory is exhausted. This does increase processing times due to the slower read times of persistent storage compared to memory; however, this is typically preferable to the query failing outright. These costs can also be mitigated by the selective use of low-latency SSD drives for applications where this is a concern.

Ease of use

The design choice of operating in-process means that users of DuckDB don’t need to concern themselves with installing, maintaining, and authenticating with a standalone database server. Another key design decision of DuckDB was for it not to make use of any third-party dependencies. This makes DuckDB extremely portable across platforms and has also enabled DuckDB to be made available for a wide range of languages and runtimes. This feature of DuckDB has increased its accessibility to a diverse range of consumers, allowing it to be readily incorporated into a wide variety of workflows and tech stacks.

DuckDB also has a strong focus on improving the ergonomics of working with SQL. It has a PostgreSQL-like SQL dialect, making it familiar to many data practitioners, and also includes a wide range of alternative function aliases, matching names used in other popular databases that many practitioners will be familiar with. Notably, DuckDB’s SQL dialect has a range of enhancements designed to improve productivity when writing analytical SQL queries. Some of these include the following:

Automatic casting of data types where possible, which serves to simplify SQL queries.
Simple creation of LIST and STRUCT data types using literal values.
Accessing attributes of STRUCT data types using dot notation.
Simple string and list slicing syntax similar to Python.
The ability to define anonymous lambda functions within SQL queries that can be used for transforming and filtering lists.
List comprehension syntax similar to Python’s, enabling more convenient list transformation and filtering.
Improved ergonomics for column selection using SELECT *, with the EXCLUDE and REPLACE clauses, which allow you to leverage wildcard selection for more concise queries, while still being able to exclude or transform specific columns. This pattern-matching power can be further enhanced with the COLUMNS() expression, which allows you to select columns that match a regular expression or the output of a lambda function.
Column aliases that can be reused by subsequent columns in the same SELECT statement, as well as in WHERE and ORDER BY clauses.
The ability to start SELECT statements with its FROM clause, allowing you to improve the readability of queries by frontloading the data source. Additionally, omitting the SELECT clause entirely is interpreted as having specified SELECT *, making this common query pattern more concise.
Function-call chaining within SQL queries, similar to familiar data processing APIs such as that of pandas, which is designed around method-call chaining.
Trailing commas in SQL queries.

We’ll cover some of these features in Chapter 10. For a more comprehensive treatment of the friendly SQL enhancements that DuckDB’s SQL dialect provides, see the DuckDB documentation: https://duckdb.org/docs/guides/sql_features/friendly_sql.

Versatility

DuckDB comes enabled with a range of affordances that make it an incredibly versatile data processing and management tool. In terms of data sources, DuckDB can read and write data formats that are the mainstays of the data world: Parquet, CSV, and JSON (including newline-delimited JSON). In addition to reading from disk, these files can be read from remotely hosted files, and DuckDB can both read and write cloud object storage services using the Simple Storage Service (S3) API. DuckDB is also able to directly connect to and query from multiple databases at a time, including PostgreSQL, MySQL, and SQLite databases.

DuckDB also has tight integrations with in-memory data formats that are commonly used in the analytical data ecosystem, including pandas and Polars dataframes and R dataframes, as well as Apache Arrow tables. DuckDB provides the ability to query directly from these in-memory data structures, as well as export to them.

DuckDB’s rich set of official clients also means that it can be used in a range of contexts beyond Python- and R-based workflows. Of particular note here is the DuckDB WebAssembly (Wasm) client, which enables developers to readily integrate DuckDB into web apps that can be published to the internet for anyone to access with a browser.

Together, all these capabilities make DuckDB a veritable data Swiss Army knife for working with analytical data, opening up many interesting applications that we have only just scratched the surface of.

Powerful analytics capabilities

DuckDB offers extensive support for composing complex queries through SQL, with a particular emphasis on features commonly used in analytical workloads. DuckDB has support for optimized aggregation and join operations, column indexes, window functions, and full-text search. DuckDB supports a wide range of functions for working with different types of data, including numeric operations, date and time operations, date formatting, text operations, bit strings, and string pattern matching, including regular expressions. Analytical workloads are further enabled by SQL commands such as SAMPLE, which provides convenient probabilistic sampling of datasets, the PIVOT command for creating pivot tables, ASOF joins for interpolating missing values when joining time series data, and the QUALIFY clause, for convenient filtering of window-functions results.

DuckDB also has a rich array of data types, which, in addition to those you’d expect, such as numeric, date, and text types, also includes handy types such as INTERVAL for date ranges, ENUM for enumerations, and powerful nested data types for holding multiple values, which include ARRAY, LIST, STRUCT, MAP, and UNION. DuckDB also offers support for analytical applications that involve working with geospatial data, using its spatial extension.

As this book was nearing completion, DuckDB released a vector similarity search extension, which enables using DuckDB’s ARRAY data type for machine learning and data science applications involving vector embeddings and semantic similarity search.

Engaged community

DuckDB is open sourced under the permissive MIT license, making it readily adoptable and extensible for a wide range of commercial and non-commercial applications. The community that has formed around DuckDB has generated many valuable open source extensions, integrations, and alternative DuckDB clients for different languages and runtimes. This rich ecosystem of DuckDB-oriented projects is indicative of and has been a powerful catalyst for the enthusiasm behind DuckDB.

When is DuckDB not the right tool?

As we have already discussed, when it comes to databases, there is no one-size-fits-all solution. This means that DuckDB won’t always be the right tool for the job.

Since it is optimized for analytical workloads running on a single machine, DuckDB has been intentionally designed to perform well under a specific set of access patterns, which you should confirm are acceptable for your use case. If your workloads correspond more to the OLTP paradigm, with many individual record transactions, including frequent writes, then DuckDB, which is optimized for OLAP workloads consisting mostly of read access and column-oriented queries, is likely not the best tool for you. If you’re after an in-process DBMS that’s optimized for OLTP workloads, then SQLite is hard to beat. Another specific consideration is that in order to open a DuckDB database that can be written to, only one process can both read from and write to the database. If you need multiple processes to be able to concurrently read from a DuckDB database, you must load it in read-only mode.

DuckDB’s performance is truly impressive, allowing you to push the limits of what a single machine can do, arguably redefining what actually constitutes big data for analytical workloads. However, it is ultimately a database that operates in-process on a single machine, and so there are limits to how far it can be pushed. If your target workloads involve regularly processing petabyte-scale datasets, then you will likely need to use some form of distributed data processing-enabled platform.

Now that we’ve established what DuckDB is and when it makes sense to use it, we’ll start to get more practical by looking at DuckDB’s deployment options and how you can get started working with it.

DuckDB deployment options and installation

In this section, we’ll look at ways you can use and integrate DuckDB into your analytical workflows, operational infrastructure, and data products. We’ll start by outlining the different clients available for DuckDB, and then we’ll go through how to get started working with the DuckDB command-line interface (CLI) on your own machine.

DuckDB deployment options

The data technology landscape is a big place, full of a diverse range of data practitioners with different skill sets and a wide range of tools built on a heterogeneous set of technologies. To cater to this diversity, DuckDB is made available via a number of different official client APIs, covering users of Python, R, JavaScript, Rust, Swift, Julia, Java, C, and C++. In the case of JavaScript, there are two clients: one for Node.js, oriented around backend applications, and one for Wasm, which allows DuckDB to run entirely within any modern web browser. Additionally, DuckDB is also made available as a cross-platform CLI, which is distributed as an executable that can be run virtually anywhere you have access to a command line. At the time of writing, there are also community-supported clients for Rust, Go, C#, Ruby, Common Lisp, Zig, and Crystal. As this list is ever-growing, we encourage you to consult the DuckDB documentation for an up-to-date list of official DuckDB clients and third-party contributed clients: https://duckdb.org/docs/api/overview.

This large selection of clients and integrations provides good coverage over languages and runtimes that meet the needs of a wide range of data practitioners. Here are the clients we’ll cover in this book:

The DuckDB CLI is a cross-platform executable that gives you a CLI for interacting with DuckDB databases via SQL. It’s compiled without any external dependencies, meaning that you can run it virtually anywhere you have access to a terminal. Its ease of installation and portability make it a great way to get started with DuckDB, which is why we will be making use of it for many of the hands-on examples in this book. We’ll walk through how to install it and get started using it shortly.
The DuckDB Python client enables Python processes to readily communicate with DuckDB databases. It offers a number of distinct APIs for querying and interacting with DuckDB, making it suitable for a range of applications, spanning the spectrum of interactive data analysis to building data infrastructure and analytical data products. In Chapter 7, we introduce the DuckDB Python client, focusing in particular on how to connect to DuckDB databases. In Chapter 8, we go on a deeper dive into the DuckDB Python client, focusing on two of the APIs that it exposes: the DuckDB-specific Relational API and the Python Database API (DB-API). In Chapter 11, we’ll use the Python client again, this time to perform EDA of a dataset in a Jupyter Notebook. Finally, in Chapter 12’s Alternative DuckDB query interfaces section, we touch on DuckDB’s Spark API—another component of the DuckDB Python client—which enables interacting with DuckDB databases using PySpark queries.
The DuckDB R client enables R sessions to connect to and work with DuckDB databases. The client provides support for connecting to DuckDB via the R database interface, as well as dbplyr integration, meaning that R users can query DuckDB databases using the powerful and popular dplyr interface, as an alternative to SQL. Along with DuckDB’s core features, this makes DuckDB a powerful component of an R analytical toolchain. We cover all this in Chapter 9, where we go on a tour of the DuckDB R client.
The DuckDB Wasm client is a full DuckDB client that has been compiled to run on Wasm, which is a virtual machine (VM) that runs on all modern browsers. With DuckDB Wasm, you can build web apps in JavaScript that can interact with DuckDB databases, running completely on client devices. This opens up a range of compelling possibilities for building lean analytical data apps with low-latency response times. In Chapter 12’s DuckDB-powered data apps section, we’ll unpack these possibilities further, identifying contexts where you might want to consider adopting DuckDB Wasm for building data apps. We’ll also cover using the DuckDB Web shell, a DuckDB CLI that runs completely within the browser, which you can try out online: https://shell.duckdb.org.
DuckDB provides clients for both the Open Database Connectivity (ODBC) API and the Java Database Connectivity (JDBC) API. These are both important workhorses in the data ecosystem, being frequently used to connect analytical data applications, such as BI tools, to databases for querying. We discuss both these DuckDB integration targets in the DuckDB integration section in Chapter 12.
The DuckDB Arrow Database Connectivity (ADBC) client provides an implementation of the ADBC API. This plays an analogous role to the JDBC and ODBC clients, enabling applications to connect to DuckDB databases as a data source, with the notable difference that the ADBC client makes use of Apache Arrow, an in-memory columnar data format. This is a much better fit for analytical applications, addressing the inefficiencies associated with the use of the OLTP-oriented JDBC and ODBC APIs. We’ll discuss the DuckDB ADBC client in Chapter 12’s DuckDB integration section.

Next, we’ll get the DuckDB CLI installed so that we’re ready to dive into some hands-on DuckDB examples.

Installing the DuckDB CLI

The DuckDB CLI is made available for Windows, macOS, and Linux operating systems. For up-to-date installation options and instructions, go to the DuckDB installation page and ensure that you have the appropriate operating system for your machine selected: https://duckdb.org/docs/installation. You can choose between using a package manager to install DuckDB onto your system, or directly downloading a ZIP file that contains the DuckDB CLI executable. If you haven’t used the package manager indicated in the instructions for your operating system, we suggest you take the direct download approach. Please proceed now to either install DuckDB using the package manager or download the DuckDB CLI ZIP file. If downloading directly, once the file has downloaded, unzip it and extract the DuckDB executable file into the directory you plan to work in.

Which version of DuckDB?

In this book, the output you’ll see comes from the most recent version of DuckDB available at the time of writing (April 2024), which is 0.10.2. DuckDB Labs has indicated that the intention is for no new features to be added between this release and the 1.0.0 release, with the focus being on stability and robustness. We therefore recommend that you use the most recent version of DuckDB to work through these exercises. You may see some minor differences in the output of commands and error messages with later versions; however, this is unlikely to interfere with the exercises. If you do come across any unexpected behavior, as always, the DuckDB documentation should be considered the source of truth: https://duckdb.org/docs. We will endeavor to keep the code in the GitHub repository for this book up to date in the event that there are breaking changes: https://github.com/PacktPublishing/Getting-Started-with-DuckDB.

Starting the DuckDB CLI

Since we’re working with a CLI, you’ll need to open a terminal application. You may have a preferred terminal application you already use, or you can simply use the one that comes with your operating system. Note that for Windows, you can use either PowerShell or the cmd.exe application. We’ll refer to the terminal application you’ve chosen as your terminal from here on.

Start by loading up your terminal and changing directory to the one you plan to work in. If you downloaded the DuckDB executable, this should be in the same directory as the one you placed the unzipped DuckDB executable in. Then, run one of the following commands appropriate for your context to start the DuckDB CLI.

Run the following command if you installed DuckDB using a package manager or if you are using Windows’ cmd.exe application for your terminal application and downloaded the executable:

$ duckdb

If you are using PowerShell on Windows and downloaded the executable, run the following command:

$ .\duckdb

For macOS and Linux terminal applications and where you downloaded the executable directly, run this command:

$ ./duckdb

Note which alternative you used, as you may need to adapt subsequent duckdb executable invocations accordingly. Your terminal should now look something like this:

v0.10.2 1601d94f94
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D

This indicates that you’re now inside the DuckDB shell, which gives you an interactive read-eval-print loop (REPL), ready for you to start entering commands and interacting with DuckDB.

Working with the DuckDB CLI

Within the DuckDB shell, the D symbol indicates a waiting prompt, meaning that it’s ready to accept input. In the next section, we’ll go through some SQL basics and run through several SQL statements for inserting and querying data. For now, let’s run a very simple query that retrieves a made-up record of values that we have specified within the query:

SELECT 'hello' AS greeting, 'world' AS name;

This query produces the following output:

┌──────────┬─────────┐
│ greeting │  name   │
│ varchar  │ varchar │
├──────────┼─────────┤
│ hello    │ world   │
└──────────┴─────────┘

After running the query, you will immediately see the resulting record, rendered as a table with a single row and corresponding column headers that indicate both the name and data type of each column.

You may have noticed that when we opened the shell, DuckDB informed us that it was connected to a transient in-memory database. This means that when you close the shell, the database being used by DuckDB, along with any data it contains, won’t be persisted anywhere. For many applications, this is perfectly fine. Many ad hoc data analysis or transformation activities can be performed within a single session, with the final results being written to disk in an appropriate file format. As you continue exploring DuckDB use cases, you will discover contexts where it will be useful to persist cleaned and prepared tables to disk so that they can be reused across subsequent sessions. For these contexts, DuckDB supports opening a database as a persistent disk-based file. We can achieve this in two ways using the DuckDB CLI.

One is via the command line, by passing the path to an existing DuckDB database file as an argument to the duckdb executable when you load the DuckDB CLI. To try this out, first, make sure you exit any running DuckDB CLI shell by executing the .quit command. Then enter the following command in your terminal:

$ ./duckdb my_db.duckdb

The other way to open a persistent file-based database is via an already running DuckDB CLI shell, using the .open CLI command followed by the path to your desired database file:

.open my_db.duckdb

In both cases, this will result in an open connection to a disk-based DuckDB database stored in the my_db.duckdb file. If this file already exists on disk, DuckDB will load and start using it. If the file does not already exist, DuckDB will first create it before connecting to it. When you close the shell, any outstanding changes to the database will be written safely to the file.

The .open command is just one of a number of special dot commands available in DuckDB’s shell. For example, the .mode command changes the formatting used to display tables that are returned after running a query. This is useful for quickly copying and pasting query results in other formats, such as CSV and JSON. For example, let’s render the output of the query we ran previously in an HTML table. To do this, first change the output mode to html and then rerun the query:

.mode html
SELECT 'hello' AS greeting, 'world' AS name;

We now see the same result but formatted as an HTML table:

<TR><TH>greeting</TH>
<TH>name</TH>
</TR>
<TR><TD>hello</TD>
<TD>world</TD>
</TR>

We can also change DuckDB’s output mode at the command line. By combining this feature with DuckDB’s ability to pass SQL commands to the duckdb executable, we can start to see a glimpse of some of the versatility DuckDB has to offer:

$ ./duckdb -markdown -s "SELECT 'hello' AS greeting, 'world' AS name;"

Running this command on your terminal results in the same output, this time formatted in markdown:

| greeting | name  |
|----------|-------|
| hello    | world |

To see the available command-line parameters the duckdb executable supports, we can pass it the --help flag:

$ ./duckdb --help

For a complete list of output modes DuckDB supports, as well as a reference for other available dot commands, see the DuckDB CLI documentation: https://duckdb.org/docs/api/cli.

Now that we’ve got the DuckDB CLI set up and have seen how to work with it, we’re ready to start our adventures getting started with DuckDB. In the next section, we provide a very brief introduction to working with SQL, via the DuckDB CLI. If you’re already well versed in writing SQL, you’re welcome to skim through or skip this section. For those of you newer to working with SQL, or if you haven’t used it for a while, this will make sure that you’re across some fundamentals we’ll be assuming as we dive into exploring DuckDB’s features.

A short SQL primer

SQL is a programming language that is specifically designed for querying, manipulating, and analyzing data. Even though SQL was originally developed in the early 1970s, it is widely used in modern data platforms and tools, with its adoption continuing to increase. SQL is a declarative language, which means that it allows us to focus on what we want to do with data, rather than having to worry about specifying the low-level steps for how to do it. It is also a rather versatile language, being frequently used across many types of applications, from ad hoc querying of data extracts to large-scale data processing pipelines and complex data analysis and reporting.

SQL’s versatility across analytical data applications, combined with its ubiquity and familiarity for many data practitioners, makes it a sensible choice for DuckDB to adopt as its primary language for managing, transforming, and querying data. It’s worth noting that SQL isn’t the only programmatic interface for interacting with DuckDB, as we’ll see later on in our DuckDB adventures. Given that many of the exercises in this book make use of SQL, in this section, we’ll go through a very short primer on some SQL fundamentals for readers who are newer to working with SQL or who haven’t used it for a while.

Creating your first DuckDB table

A database table is a collection of related data, organized in a tabular format consisting of rows and columns. Each row in a table represents a distinct record of the data being stored, while each column represents a specific attribute of the data stored in the table. Tables are an essential part of a database, providing a structured way to organize, store, and query data. Tables can also be linked to other tables through relationships, allowing for more complex data queries and analysis.

For our first example, let’s imagine we need to store information about food types in a database. Let’s create a table called foods, with columns describing attributes of each food, such as food_name, calories, and is_healthy. Each record in the foods table would represent a unique food type, with specific information about each food item stored in corresponding columns.

Let’s now create our first DuckDB table. Creating a table in SQL involves specifying the table name, as well as the column name and data type for each column we want the table to have. The following SQL statement creates a simple table called foods, with the columns we outlined previously. Run this now in an open DuckDB CLI shell:

CREATE TABLE foods (
    food_name VARCHAR PRIMARY KEY,
    color VARCHAR,
    calories INT,
    is_healthy BOOLEAN
);

Note that DuckDB allows you to write multiline commands, with a semicolon (;) character being used to indicate the end of a SQL statement. It tells the DuckDB database engine that the current statement has ended and that it should be executed before moving on to the next statement.

You have now created a table named foods with the following four columns:

food_name: The name of the food. We are using a VARCHAR data type, which is used to store variable-length character strings. The food_name column is also specified as the table’s PRIMARY KEY constraint. This constraint ensures that each row in the table has a unique value for this column.
color: The color of the food. This is also stored as a VARCHAR data type.
calories: The calorie count of the food. This is stored as an INTEGER data type, which is used to represent whole numeric values, without any decimal places.
is_healthy: An indicator of whether the food item is considered healthy. This is stored as a BOOLEAN data type, which can only take one of two values: true or false.

Once a table is created, data can be inserted into it using the INSERT statement. For example, the following SQL statement inserts a new record into the foods table:

INSERT INTO foods (food_name, color, calories, is_healthy)
VALUES ('apple', 'red', 100, true);

This inserts a new record with the values 'apple' for the food_name column, 'red' for the color column, 100 for the calories column, and true for the is_healthy column.

We can use the INSERT statement to insert multiple records at a time. The following SQL statement inserts three new records into the foods table:

INSERT INTO foods (food_name, color, calories, is_healthy)
VALUES ('banana', 'yellow', 100, true),
       ('cookie', 'brown', 200, false),
       ('chocolate', 'brown', 150, false);

Running this statement results in three new records being inserted into our table, bringing this up to four food items in the table. There are a range of additional features that the INSERT statement offers, which you can explore at the DuckDB documentation: https://duckdb.org/docs/sql/statements/insert.

Running your first DuckDB query

Now, let’s have a look at the data we added to the foods table. To do this, we’ll use the SQL SELECT command, which is used to retrieve data from one (or more) tables:

SELECT *
FROM foods;

Running this query produces the following output:

┌───────────┬─────────┬──────────┬────────────┐
│ food_name │  color  │ calories │ is_healthy │
│  varchar  │ varchar │  int32   │  boolean   │
├───────────┼─────────┼──────────┼────────────┤
│ apple     │ red     │      100 │ true       │
│ banana    │ yellow  │      100 │ true       │
│ cookie    │ brown   │      200 │ false      │
│ chocolate │ brown   │      150 │ false      │
└───────────┴─────────┴──────────┴────────────┘

Let’s unpack that query, which we can see has returned the four food items that we previously inserted into the table:

SELECT: Specifies the columns we want to retrieve from the table. We selected all columns in the target table by using the asterisk symbol (*), which functions as a wildcard. Alternatively, we could have explicitly listed one or more column names separated with commas, to return only a subset of columns.
FROM: Specifies the name of the table we want to retrieve data from; in this case, the foods table.

As we mentioned earlier in this chapter, DuckDB’s SQL dialect comes with a range of enhancements that extend traditional SQL syntax, with an eye toward a more user-friendly experience. One of these enhancements is the ability to omit the SELECT clause from a SELECT statement when returning all columns. This means that the query we just ran could be replaced with the following functionally identical and more concise query:

FROM foods;

When we created the foods table, we set a primary key on the food_name column. This instructs DuckDB to enforce the constraint that values in this column must be unique across all rows in the table. With this PRIMARY KEY constraint defined on our table, we have ensured that there is no duplication of food items in the table. We can see this constraint in action by trying to add an extra record with the name 'apple' again, but this time 'green' in color:

INSERT INTO foods (food_name, color, calories, is_healthy)
VALUES ('apple', 'green', 100, true);
Error: Constraint Error: Duplicate key "food_name: apple" violates primary key constraint. If this is an unexpected constraint violation please double check with the known index limitations section in our documentation (https://duckdb.org/docs/sql/indexes).

This error indicates our insert failed, as we expected should happen. Let’s check we have only the single original red apple by querying the table again. This time, we’ll restrict the SELECT clause to only retrieve values from the food_name and color columns, which are the values we need to check. We’ll also use a WHERE clause to apply a filter to our query so that it only returns records with a food_name value of 'apple':

SELECT food_name, color
FROM foods
WHERE food_name = 'apple';

This query returns a single result, as we expected:

┌───────────┬─────────┐
│ food_name │  color  │
│  varchar  │ varchar │
├───────────┼─────────┤
│ apple     │ red     │
└───────────┴─────────┘

Now, if we did want to change the color of the existing apple record, we could achieve this by modifying the value of its color field using the UPDATE statement. The following statement shows how we can do this:

UPDATE foods
SET color = 'green'
WHERE food_name = 'apple';

Note the use of the WHERE clause to specify the condition that must be met for the rows to be updated. Because of this filter, our update is only applied to records where food_name has the value 'apple'.

Let’s verify for ourselves that the existing apple record has had its color updated successfully by running the previous SELECT statement again:

SELECT food_name, color
FROM foods
WHERE food_name = 'apple';

This time, we see our green apple:

┌───────────┬─────────┐
│ food_name │  color  │
│  varchar  │ varchar │
├───────────┼─────────┤
│ apple     │ green   │
└───────────┴─────────┘

Lastly, another one of DuckDB’s SQL dialect enhancements is that when constructing a SELECT statement, the FROM clause can be placed before the SELECT clause. This enables queries to follow a more natural sequence, with the data source being queried placed up front, before data-source-specific components of the query, such as columns to be retrieved and column filters to be applied. Using this SQL enhancement, the query that we just ran can be rewritten as follows:

FROM foods
SELECT food_name, color
WHERE food_name = 'apple';

This brings us to the end of our condensed primer on some of the basics of SQL, which we’ll be assuming you’re across as we dive into exploring DuckDB’s impressive range of SQL-based analytical capabilities. If you’re newer to working with SQL, and you feel like you could benefit from a more comprehensive introduction to SQL to pair with this book, you could consider reading Packt Publishing’s Learn SQL Database Programming by Josephine Bush. It’s also worth noting that, in the examples we’ve just worked through, we’ve only been working with individual records, as opposed to analytical operations over columns containing many records. As we work through the book, we will explore different types of SQL-defined analytical workloads and use cases that play to DuckDB’s strengths. We’ll also see ways in which you can work with DuckDB using alternative non-SQL interfaces, which may appeal to data scientists and data analysts working in Python or R in particular. By the end of the book, we think that you’ll see how DuckDB’s adoption of SQL as its core interface enables it to be an accessible, powerful, and flexible tool for managing analytical data workloads.

Vishnuvardhan Oct 17, 2024

"Getting Started with DuckDB" provides an excellent, hands-on introduction to DuckDB, showcasing its speed and versatility in data analytics and engineering. The practical examples and easy-to-follow explanations make it a valuable resource for anyone looking to enhance their workflows. Ideal for beginners and experienced professionals alike, it bridges the gap between theory and application effectively

Amazon Verified review