Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

Hands-On Exploratory Data Analysis with DuckDB

Save for later
View related Packt books & videos

article-image

This article is an excerpt from the book, Getting Started with DuckDB, by Simon Aubury and Ned Letcher. Discover how Snowflake's unique objects and features can be used to leverage universal modeling techniques through real-world examples and SQL recipes.

Introduction

 DuckDB is a versatile and highly optimized database management system designed for efficient data analysis workflows. Its capabilities allow practitioners to scale their data analysis efforts beyond traditional tools, making it an excellent choice for local machine data processing. In this excerpt, we will explore how to use DuckDB for hands-on exploratory data analysis, leveraging Python, Jupyter Notebooks, and Plotly for interactive data visualizations.

Technical Requirements

To follow along with the examples in this guide, you will need the following setup:

  • Python environment
  • Jupyter Notebook
  • DuckDB installed
  • JupySQL library
  • Plotly library

You can find the necessary code examples in the chapter_11 folder in the book’s GitHub repository at [PacktPublishing](https://github.com/PacktPublishing/Getting-Started-with-DuckDB/tree/main/chapter_11).

Obtaining the Dataset

 We will be using a pedestrian counting system dataset from the city of Melbourne, containing hourly pedestrian counts from sensors located in and around the Melbourne Central Business District (CBD). This dataset provides a comprehensive view of pedestrian traffic patterns over several years.

To download the dataset, visit the dataset’s home page [Melbourne Pedestrian Counting System](https://data.melbourne.vic.gov.au/explore/dataset/pedestrian-counting-system-monthly-counts-per-hour) and locate the ZIP file containing the 2009 to 2022 archive.

Setting Up the Environment

 Before diving into the code, ensure your Python environment is set up with the necessary dependencies. You will need to:

 1. Set up a Python virtual environment:

python -m venv duckdb_env
source duckdb_env/bin/activate

 2. Install the required libraries:

   
pip install jupyter duckdb plotly jupysql pandas
  

3. Start Jupyter Notebook:

 jupyter notebook

 Loading and Cleaning Data

First, we will load our dataset from a CSV file and perform some data cleaning steps before writing it to a DuckDB database.

Loading CSV Data into DuckDB

import duckdb
import pandas as pd
 
# Load the dataset into a pandas DataFrame
data_url = "path_to_downloaded_zip_file/2022/2022.csv"
pedestrian_counts = pd.read_csv(data_url)
 
# Display the first few rows of the dataframe
print(pedestrian_counts.head())
 
# Create a DuckDB connection and write the DataFrame to a DuckDB table
con = duckdb.connect(database=':memory:')
con.execute("CREATE TABLE pedestrian_counts AS SELECT * FROM pedestrian_counts")
```

Data Cleaning Steps

Perform necessary data cleaning operations such as handling missing values, correcting data types, and filtering irrelevant records.

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
# Convert the 'Date_Time' column to datetime format
pedestrian_counts['Date_Time'] = pd.to_datetime(pedestrian_counts['Date_Time'])
 
# Handle missing values by filling them with 0
pedestrian_counts = pedestrian_counts.fillna(0)
 
# Write the cleaned data to DuckDB
con.execute("DROP TABLE pedestrian_counts")
con.execute("CREATE TABLE pedestrian_counts AS SELECT * FROM pedestrian_counts")
 
# Verify the cleaned data
result = con.execute("SELECT * FROM pedestrian_counts LIMIT 5").fetchdf()
print(result)

Using JupySQL for SQL Queries

JupySQL is a powerful library that allows you to run SQL queries directly in Jupyter Notebooks. This makes it easy to interact with your DuckDB database without switching contexts.

 #### Example JupySQL Query

%load_ext sql
%sql duckdb:///:memory:
 
# Query to view the first few rows of the dataset
%%sql
SELECT * FROM pedestrian_counts LIMIT 5;

Visualizing Data with Plotly

 Plotly is a versatile data visualization library that integrates well with Jupyter Notebooks. We will use it to create interactive visualizations of our dataset.

Total Pedestrian Counts Over Time

import plotly.express as px
 
# Aggregate pedestrian counts by year
yearly_counts = con.execute("""
    SELECT strftime('%Y', Date_Time) AS Year, SUM(Counts) AS Total_Counts
    FROM pedestrian_counts
    GROUP BY Year
    ORDER BY Year
""").fetchdf()
 
# Create a bar chart
fig = px.bar(yearly_counts, x='Year', y='Total_Counts', title='Total Pedestrian Counts by Year')
fig.show()

Monthly Traffic Counts

# Aggregate pedestrian counts by month for the years 2019 and 2020
monthly_counts = con.execute("""
    SELECT strftime('%Y-%m', Date_Time) AS Month, SUM(Counts) AS Monthly_Counts
    FROM pedestrian_counts
    WHERE strftime('%Y', Date_Time) IN ('2019', '2020')
    GROUP BY Month
    ORDER BY Month
""").fetchdf()
 
# Create a line chart to compare the two years
fig = px.line(monthly_counts, x='Month', y='Monthly_Counts', title='Monthly Pedestrian Counts for 2019 and 2020')
fig.show()

Hourly Traffic Patterns

# Aggregate pedestrian counts by hour of the day
hourly_counts = con.execute("""
    SELECT strftime('%H', Date_Time) AS Hour, AVG(Counts) AS Average_Counts
    FROM pedestrian_counts
    GROUP BY Hour
    ORDER BY Hour
""").fetchdf()
 
# Create a line chart for hourly patterns
fig = px.line(hourly_counts, x='Hour', y='Average_Counts', title='Average Hourly Pedestrian Counts')
fig.show()

Exploratory Data Analysis

 With our dataset loaded and visualized, we can perform a more detailed exploratory data analysis.

Comparing Traffic on Weekdays vs. Weekends

# Add a column for day of the week
pedestrian_counts['Day_of_Week'] = pedestrian_counts['Date_Time'].dt.day_name()
 
# Aggregate pedestrian counts by day of the week
daily_counts = con.execute("""
    SELECT Day_of_Week, AVG(Counts) AS Average_Counts
    FROM pedestrian_counts
    GROUP BY Day_of_Week
    ORDER BY FIELD(Day_of_Week, 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')
""").fetchdf()
 
# Create a bar chart for daily patterns
fig = px.bar(daily_counts, x='Day_of_Week', y='Average_Counts', title='Average Pedestrian Counts by Day of the Week')
fig.show()

Peak Hours of Pedestrian Traffic

# Identify peak hours by finding the hours with the highest average counts
peak_hours = con.execute("""
    SELECT strftime('%H', Date_Time) AS Hour, AVG(Counts) AS Average_Counts
    FROM pedestrian_counts
    GROUP BY Hour
    ORDER BY Average_Counts DESC
    LIMIT 5
""").fetchdf()
 
# Create a bar chart for peak hours
fig = px.bar(peak_hours, x='Hour', y='Average_Counts', title='Peak Hours of Pedestrian Traffic')
fig.show()

Conclusion

DuckDB, combined with JupySQL and Plotly, provides a robust framework for performing hands-on exploratory data analysis. By leveraging DuckDB’s high-performance SQL capabilities and integrating with powerful visualization tools, you can efficiently uncover insights from your data. We encourage you to further explore DuckDB’s features and apply these techniques to your datasets.

For a deeper dive into DuckDB's powerful data analysis capabilities and to explore more advanced topics, we highly recommend reading the book 'Getting Started with DuckDB' by Simon Aubury and Ned Letcher."

Author Bio

Simon Aubury has been working in the IT industry since 2000 as a data engineering specialist. He has an extensive background in building large, flexible, highly available distributed data systems. Simon has delivered critical data systems for finance, transport, healthcare, insurance, and telecommunications clients in Australia, Europe, and Asia Pacific. In 2019, Simon joined ThoughtWorks as a principal data engineer and today is associate director of data platforms at Simple Machines in Sydney, Australia. Simon is active in the data community, a regular conference speaker, and the organizer of local and international meetups and data engineering conferences.

Ned Letcher has worked as a data science and software engineering consultant since completing his PhD in computational linguistics in 2018 and currently works at Thoughtworks. He has designed and developed data-powered products and services across a range of industries and helped organizations and teams improve the effectiveness of their data processes and workflows. Ned has also worked as a Python trainer, supporting both tertiary students and data professionals across various organizations. He is active in the data community, speaking at and helping organize meetups and conferences, as well as contributing to a range of open source projects.