This article is an excerpt from the book, Getting Started with DuckDB, by Simon Aubury and Ned Letcher. Discover how Snowflake's unique objects and features can be used to leverage universal modeling techniques through real-world examples and SQL recipes.
DuckDB is a versatile and highly optimized database management system designed for efficient data analysis workflows. Its capabilities allow practitioners to scale their data analysis efforts beyond traditional tools, making it an excellent choice for local machine data processing. In this excerpt, we will explore how to use DuckDB for hands-on exploratory data analysis, leveraging Python, Jupyter Notebooks, and Plotly for interactive data visualizations.
To follow along with the examples in this guide, you will need the following setup:
You can find the necessary code examples in the chapter_11 folder in the book’s GitHub repository at [PacktPublishing](https://github.com/PacktPublishing/Getting-Started-with-DuckDB/tree/main/chapter_11).
We will be using a pedestrian counting system dataset from the city of Melbourne, containing hourly pedestrian counts from sensors located in and around the Melbourne Central Business District (CBD). This dataset provides a comprehensive view of pedestrian traffic patterns over several years.
To download the dataset, visit the dataset’s home page [Melbourne Pedestrian Counting System](https://data.melbourne.vic.gov.au/explore/dataset/pedestrian-counting-system-monthly-counts-per-hour) and locate the ZIP file containing the 2009 to 2022 archive.
Before diving into the code, ensure your Python environment is set up with the necessary dependencies. You will need to:
python -m venv duckdb_env
source duckdb_env/bin/activate
pip install jupyter duckdb plotly jupysql pandas
jupyter notebook
First, we will load our dataset from a CSV file and perform some data cleaning steps before writing it to a DuckDB database.
import duckdb
import pandas as pd
# Load the dataset into a pandas DataFrame
data_url = "path_to_downloaded_zip_file/2022/2022.csv"
pedestrian_counts = pd.read_csv(data_url)
# Display the first few rows of the dataframe
print(pedestrian_counts.head())
# Create a DuckDB connection and write the DataFrame to a DuckDB table
con = duckdb.connect(database=':memory:')
con.execute("CREATE TABLE pedestrian_counts AS SELECT * FROM pedestrian_counts")
```
Perform necessary data cleaning operations such as handling missing values, correcting data types, and filtering irrelevant records.
# Convert the 'Date_Time' column to datetime format
pedestrian_counts['Date_Time'] = pd.to_datetime(pedestrian_counts['Date_Time'])
# Handle missing values by filling them with 0
pedestrian_counts = pedestrian_counts.fillna(0)
# Write the cleaned data to DuckDB
con.execute("DROP TABLE pedestrian_counts")
con.execute("CREATE TABLE pedestrian_counts AS SELECT * FROM pedestrian_counts")
# Verify the cleaned data
result = con.execute("SELECT * FROM pedestrian_counts LIMIT 5").fetchdf()
print(result)
JupySQL is a powerful library that allows you to run SQL queries directly in Jupyter Notebooks. This makes it easy to interact with your DuckDB database without switching contexts.
#### Example JupySQL Query
%load_ext sql
%sql duckdb:///:memory:
# Query to view the first few rows of the dataset
%%sql
SELECT * FROM pedestrian_counts LIMIT 5;
Plotly is a versatile data visualization library that integrates well with Jupyter Notebooks. We will use it to create interactive visualizations of our dataset.
import plotly.express as px
# Aggregate pedestrian counts by year
yearly_counts = con.execute("""
SELECT strftime('%Y', Date_Time) AS Year, SUM(Counts) AS Total_Counts
FROM pedestrian_counts
GROUP BY Year
ORDER BY Year
""").fetchdf()
# Create a bar chart
fig = px.bar(yearly_counts, x='Year', y='Total_Counts', title='Total Pedestrian Counts by Year')
fig.show()
# Aggregate pedestrian counts by month for the years 2019 and 2020
monthly_counts = con.execute("""
SELECT strftime('%Y-%m', Date_Time) AS Month, SUM(Counts) AS Monthly_Counts
FROM pedestrian_counts
WHERE strftime('%Y', Date_Time) IN ('2019', '2020')
GROUP BY Month
ORDER BY Month
""").fetchdf()
# Create a line chart to compare the two years
fig = px.line(monthly_counts, x='Month', y='Monthly_Counts', title='Monthly Pedestrian Counts for 2019 and 2020')
fig.show()
# Aggregate pedestrian counts by hour of the day
hourly_counts = con.execute("""
SELECT strftime('%H', Date_Time) AS Hour, AVG(Counts) AS Average_Counts
FROM pedestrian_counts
GROUP BY Hour
ORDER BY Hour
""").fetchdf()
# Create a line chart for hourly patterns
fig = px.line(hourly_counts, x='Hour', y='Average_Counts', title='Average Hourly Pedestrian Counts')
fig.show()
With our dataset loaded and visualized, we can perform a more detailed exploratory data analysis.
# Add a column for day of the week
pedestrian_counts['Day_of_Week'] = pedestrian_counts['Date_Time'].dt.day_name()
# Aggregate pedestrian counts by day of the week
daily_counts = con.execute("""
SELECT Day_of_Week, AVG(Counts) AS Average_Counts
FROM pedestrian_counts
GROUP BY Day_of_Week
ORDER BY FIELD(Day_of_Week, 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')
""").fetchdf()
# Create a bar chart for daily patterns
fig = px.bar(daily_counts, x='Day_of_Week', y='Average_Counts', title='Average Pedestrian Counts by Day of the Week')
fig.show()
# Identify peak hours by finding the hours with the highest average counts
peak_hours = con.execute("""
SELECT strftime('%H', Date_Time) AS Hour, AVG(Counts) AS Average_Counts
FROM pedestrian_counts
GROUP BY Hour
ORDER BY Average_Counts DESC
LIMIT 5
""").fetchdf()
# Create a bar chart for peak hours
fig = px.bar(peak_hours, x='Hour', y='Average_Counts', title='Peak Hours of Pedestrian Traffic')
fig.show()
DuckDB, combined with JupySQL and Plotly, provides a robust framework for performing hands-on exploratory data analysis. By leveraging DuckDB’s high-performance SQL capabilities and integrating with powerful visualization tools, you can efficiently uncover insights from your data. We encourage you to further explore DuckDB’s features and apply these techniques to your datasets.
For a deeper dive into DuckDB's powerful data analysis capabilities and to explore more advanced topics, we highly recommend reading the book 'Getting Started with DuckDB' by Simon Aubury and Ned Letcher."
Simon Aubury has been working in the IT industry since 2000 as a data engineering specialist. He has an extensive background in building large, flexible, highly available distributed data systems. Simon has delivered critical data systems for finance, transport, healthcare, insurance, and telecommunications clients in Australia, Europe, and Asia Pacific. In 2019, Simon joined ThoughtWorks as a principal data engineer and today is associate director of data platforms at Simple Machines in Sydney, Australia. Simon is active in the data community, a regular conference speaker, and the organizer of local and international meetups and data engineering conferences.
Ned Letcher has worked as a data science and software engineering consultant since completing his PhD in computational linguistics in 2018 and currently works at Thoughtworks. He has designed and developed data-powered products and services across a range of industries and helped organizations and teams improve the effectiveness of their data processes and workflows. Ned has also worked as a Python trainer, supporting both tertiary students and data professionals across various organizations. He is active in the data community, speaking at and helping organize meetups and conferences, as well as contributing to a range of open source projects.