Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Getting Started with DuckDB

You're reading from   Getting Started with DuckDB A practical guide for accelerating your data science, data analytics, and data engineering workflows

Arrow left icon
Product type Paperback
Published in Jun 2024
Publisher Packt
ISBN-13 9781803241005
Length 382 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Ned Letcher Ned Letcher
Author Profile Icon Ned Letcher
Ned Letcher
Simon Aubury Simon Aubury
Author Profile Icon Simon Aubury
Simon Aubury
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Chapter 1: An Introduction to DuckDB 2. Chapter 2: Loading Data into DuckDB FREE CHAPTER 3. Chapter 3: Data Manipulation with DuckDB 4. Chapter 4: DuckDB Operations and Performance 5. Chapter 5: DuckDB Extensions 6. Chapter 6: Semi-Structured Data Manipulation 7. Chapter 7: Setting up the DuckDB Python Client 8. Chapter 8: Exploring DuckDB’s Python API 9. Chapter 9: Exploring DuckDB’s R API 10. Chapter 10: Using DuckDB Effectively 11. Chapter 11: Hands-On Exploratory Data Analysis with DuckDB 12. Chapter 12: DuckDB – The Wider Pond 13. Index 14. Other Books You May Enjoy

What is DuckDB?

Whether you’re an experienced data practitioner or just getting started working with data, you will almost certainly find yourself having to navigate the dizzying number of databases and data processing tools that you can choose from to support data-centric applications and operational systems. The reason for this overwhelming choice is that when it comes to data processing and management architectures, there is no one-size-fits-all. Each tool necessarily comes with its own set of trade-offs that make it well suited to a particular flavor of application and less so to others.

With that in mind, let’s dig into what kind of database DuckDB is and where it sits in the data-tooling landscape so that we can unpack what kinds of applications and use cases it is well suited to. One description of DuckDB, which you might encounter when poking around online resources, is the following:

DuckDB is an in-process SQL OLAP DBMS.

While this is a fairly dense description, invoking several distinct concepts from the world of databases and software applications, it does a great job of positioning where DuckDB sits in relation to other databases and data processing tools. So, let’s break this description down, going through each component and working our way from right to left:

  • A database management system (DBMS) is a software application for managing structured data in a database, allowing users and applications to store, manipulate, delete, and query records. While you might hear the term database being used as shorthand for DBMS, it’s worth noting that a DBMS provides additional functionality on top of the core features of a database—which is essentially to store data in a structured format that supports efficient retrieval and manipulation. A DBMS provides an interface between the database and its users, enabling them to effectively create, read, update, and delete data, while also managing the integrity, scalability, and security of the database. DuckDB is a fully-fledged DBMS that manages all these concerns for users.
  • Online analytical processing (OLAP) is a data processing paradigm that is characterized by complex queries over large volumes of multidimensional data, which often involve processing significant portions of a dataset. These analytical workloads often involve applying column-wise aggregation functions over entire tables and joining large tables together. The term was created in contrast to online transaction processing (OLTP), which describes transaction-oriented DBMS tools, such as PostgreSQL, MySQL, and SQLite, which are typically used as operational databases supporting software applications, where frequent reading and writing of individual records is the dominant access pattern. DuckDB is designed and optimized for fast and efficient performance over OLAP workloads.
  • SQL is a popular programming language used for storing, manipulating, and querying records in a wide variety of databases and data stores. It is a standard interface used for interacting with and managing relational databases, which are databases characterized by the representation of data as tables of rows and columns, with formal relationships defined across tables. SQL’s increasing ubiquity has made it something of a de facto choice for code-defined data-querying interfaces. DuckDB has its own SQL dialect, which forms the primary interface for interacting with DuckDB databases. As we will see, there are also non-SQL interfaces available for users to work with DuckDB databases. In the last section of this chapter, A short SQL primer, we’ll cover a brief introduction to the fundamentals of working with SQL for those who are new to working with it or a little rusty.
  • In-process means that DuckDB runs embedded within a host process. This is in contrast to most DBMSs, which typically operate standalone, running in a separate process from consuming applications, often on a remote server. By adopting an in-process model rather than a client-server architecture, DuckDB greatly simplifies installation and integration, removing the need to install and manage a standalone DBMS service, as well as the need to connect and authenticate with a remote server. A notable example of an in-process DBMS that you may have encountered is SQLite, which is a popular choice for software developers distributing apps that require reading and writing local transactional data, such as user data for mobile apps and lightweight web apps.

Putting all these pieces together, we can see that DuckDB is a fully featured relational DBMS (RDBMS) that is designed for analytical workloads, provides a SQL interface, and runs entirely embedded in a host process.

When compared with other popular databases, DuckDB is perhaps most similar to the ubiquitous SQLite in that they are both simple in-process DBMSs that write to a single-file storage format, and they are also both free and open source. The key difference between the two tools is that SQLite is optimized for row-oriented OLTP workloads and hence does not perform well on complex analytical workloads, whereas DuckDB is purpose-built for these workloads, offering extremely good performance over them. It’s for this reason that DuckDB is sometimes described as SQLite for OLAP. In fact, DuckDB appears to be the first production-ready in-process OLAP DBMS.

In the next section, we’ll explore the reasons why people are increasingly adopting DuckDB and finding it to be a valuable workhorse in their analytical data toolkit.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image