Preface
There is no shortage of data being produced by humanity, in myriad formats, shapes, and ever-growing quantities. As it grows, so do the opportunities for leveraging data to benefit our world: improving decision making for governments, companies, and public organizations; supporting scientific research and technological advancements; and enabling the development of consumer products and important public services. To realize these opportunities, we are faced with an imperative: if we want to perform effective data analysis and develop products and services infused with machine learning, we must be able to manage, understand, and effectively work with the data that makes it possible.
Whether you are a data analyst, data scientist, research scientist, data engineer, software engineer, or data hobbyist, you are likely to face many of the same challenges when it comes to working with data. Analytical data workflows and applications require that data be loaded, cleaned, transformed, organized, exported, and crunched into summarized forms. A running joke amongst data practitioners is that they spend more time preparing and wrangling their data, as well as fighting with the tools that support their work than they do on the value-producing activities that are likely to be in their job descriptions. As data grows in volume and variety, these activities become both more difficult and more pressing to solve.
DuckDB is an analytical database that handles many of these challenges with ease. It enables data practitioners to streamline and improve the effectiveness of activities across the entire life cycle of data analysis and the development of analytical data infrastructure. It is simple to install and use on virtually any machine, running entirely in-process—without the overheads of connecting to and maintaining a dedicated server. At the same time, it offers blazing-fast performance for analytical operations, as well as powerful data management capabilities — features that are normally associated with distributed data processing engines and dedicated SQL database management systems. DuckDB’s rich feature set makes it an incredibly versatile tool, being well suited to a range of different use cases, such as performing interactive data analysis and ad hoc data wrangling, efficiently querying data lakes, developing lean pipelines for transforming data, functioning as an operational data warehouse, and forming a low-latency query engine for powering responsive data apps. This versatility can also be a bit overwhelming at first, as it’s hard to compare DuckDB with any one existing tool that you might be familiar with.
In this book, we’ll dive into many of DuckDB’s powerful and flexible capabilities. We’ll give you a clear framework for how to think about what kind of a data tool DuckDB is and the types of applications it excels at. Through a range of hands-on examples, you’ll learn how to make the most of this exciting tool and discover the many ways that you can incorporate it into your own analytical workflows and projects.