Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

How-To Tutorials - DuckDB

1 Articles
article-image-hands-on-exploratory-data-analysis-with-duckdb
Ned Letcher
28 Jun 2024
7 min read
Save for later

Hands-On Exploratory Data Analysis with DuckDB

Ned Letcher
28 Jun 2024
7 min read
This article is an excerpt from the book, Getting Started with DuckDB, by Simon Aubury and Ned Letcher. Discover how Snowflake's unique objects and features can be used to leverage universal modeling techniques through real-world examples and SQL recipes.Introduction DuckDB is a versatile and highly optimized database management system designed for efficient data analysis workflows. Its capabilities allow practitioners to scale their data analysis efforts beyond traditional tools, making it an excellent choice for local machine data processing. In this excerpt, we will explore how to use DuckDB for hands-on exploratory data analysis, leveraging Python, Jupyter Notebooks, and Plotly for interactive data visualizations.Technical RequirementsTo follow along with the examples in this guide, you will need the following setup:Python environmentJupyter NotebookDuckDB installedJupySQL libraryPlotly libraryYou can find the necessary code examples in the chapter_11 folder in the book’s GitHub repository at [PacktPublishing](https://github.com/PacktPublishing/Getting-Started-with-DuckDB/tree/main/chapter_11).Obtaining the Dataset We will be using a pedestrian counting system dataset from the city of Melbourne, containing hourly pedestrian counts from sensors located in and around the Melbourne Central Business District (CBD). This dataset provides a comprehensive view of pedestrian traffic patterns over several years.To download the dataset, visit the dataset’s home page [Melbourne Pedestrian Counting System](https://data.melbourne.vic.gov.au/explore/dataset/pedestrian-counting-system-monthly-counts-per-hour) and locate the ZIP file containing the 2009 to 2022 archive.Setting Up the Environment Before diving into the code, ensure your Python environment is set up with the necessary dependencies. You will need to: 1. Set up a Python virtual environment:python -m venv duckdb_env source duckdb_env/bin/activate 2. Install the required libraries:   pip install jupyter duckdb plotly jupysql pandas  3. Start Jupyter Notebook: jupyter notebook Loading and Cleaning DataFirst, we will load our dataset from a CSV file and perform some data cleaning steps before writing it to a DuckDB database.Loading CSV Data into DuckDBimport duckdb import pandas as pd # Load the dataset into a pandas DataFrame data_url = "path_to_downloaded_zip_file/2022/2022.csv" pedestrian_counts = pd.read_csv(data_url) # Display the first few rows of the dataframe print(pedestrian_counts.head()) # Create a DuckDB connection and write the DataFrame to a DuckDB table con = duckdb.connect(database=':memory:') con.execute("CREATE TABLE pedestrian_counts AS SELECT * FROM pedestrian_counts") ```Data Cleaning StepsPerform necessary data cleaning operations such as handling missing values, correcting data types, and filtering irrelevant records.# Convert the 'Date_Time' column to datetime format pedestrian_counts['Date_Time'] = pd.to_datetime(pedestrian_counts['Date_Time']) # Handle missing values by filling them with 0 pedestrian_counts = pedestrian_counts.fillna(0) # Write the cleaned data to DuckDB con.execute("DROP TABLE pedestrian_counts") con.execute("CREATE TABLE pedestrian_counts AS SELECT * FROM pedestrian_counts") # Verify the cleaned data result = con.execute("SELECT * FROM pedestrian_counts LIMIT 5").fetchdf() print(result)Using JupySQL for SQL QueriesJupySQL is a powerful library that allows you to run SQL queries directly in Jupyter Notebooks. This makes it easy to interact with your DuckDB database without switching contexts. #### Example JupySQL Query%load_ext sql %sql duckdb:///:memory: # Query to view the first few rows of the dataset %%sql SELECT * FROM pedestrian_counts LIMIT 5;Visualizing Data with Plotly Plotly is a versatile data visualization library that integrates well with Jupyter Notebooks. We will use it to create interactive visualizations of our dataset.Total Pedestrian Counts Over Timeimport plotly.express as px # Aggregate pedestrian counts by year yearly_counts = con.execute("""    SELECT strftime('%Y', Date_Time) AS Year, SUM(Counts) AS Total_Counts    FROM pedestrian_counts    GROUP BY Year    ORDER BY Year """).fetchdf() # Create a bar chart fig = px.bar(yearly_counts, x='Year', y='Total_Counts', title='Total Pedestrian Counts by Year') fig.show()Monthly Traffic Counts# Aggregate pedestrian counts by month for the years 2019 and 2020 monthly_counts = con.execute("""    SELECT strftime('%Y-%m', Date_Time) AS Month, SUM(Counts) AS Monthly_Counts    FROM pedestrian_counts    WHERE strftime('%Y', Date_Time) IN ('2019', '2020')    GROUP BY Month    ORDER BY Month """).fetchdf() # Create a line chart to compare the two years fig = px.line(monthly_counts, x='Month', y='Monthly_Counts', title='Monthly Pedestrian Counts for 2019 and 2020') fig.show()Hourly Traffic Patterns# Aggregate pedestrian counts by hour of the day hourly_counts = con.execute("""    SELECT strftime('%H', Date_Time) AS Hour, AVG(Counts) AS Average_Counts    FROM pedestrian_counts    GROUP BY Hour    ORDER BY Hour """).fetchdf() # Create a line chart for hourly patterns fig = px.line(hourly_counts, x='Hour', y='Average_Counts', title='Average Hourly Pedestrian Counts') fig.show()Exploratory Data Analysis With our dataset loaded and visualized, we can perform a more detailed exploratory data analysis.Comparing Traffic on Weekdays vs. Weekends# Add a column for day of the week pedestrian_counts['Day_of_Week'] = pedestrian_counts['Date_Time'].dt.day_name() # Aggregate pedestrian counts by day of the week daily_counts = con.execute("""    SELECT Day_of_Week, AVG(Counts) AS Average_Counts    FROM pedestrian_counts    GROUP BY Day_of_Week    ORDER BY FIELD(Day_of_Week, 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday') """).fetchdf() # Create a bar chart for daily patterns fig = px.bar(daily_counts, x='Day_of_Week', y='Average_Counts', title='Average Pedestrian Counts by Day of the Week') fig.show()Peak Hours of Pedestrian Traffic# Identify peak hours by finding the hours with the highest average counts peak_hours = con.execute("""    SELECT strftime('%H', Date_Time) AS Hour, AVG(Counts) AS Average_Counts    FROM pedestrian_counts    GROUP BY Hour    ORDER BY Average_Counts DESC    LIMIT 5 """).fetchdf() # Create a bar chart for peak hours fig = px.bar(peak_hours, x='Hour', y='Average_Counts', title='Peak Hours of Pedestrian Traffic') fig.show()ConclusionDuckDB, combined with JupySQL and Plotly, provides a robust framework for performing hands-on exploratory data analysis. By leveraging DuckDB’s high-performance SQL capabilities and integrating with powerful visualization tools, you can efficiently uncover insights from your data. We encourage you to further explore DuckDB’s features and apply these techniques to your datasets.For a deeper dive into DuckDB's powerful data analysis capabilities and to explore more advanced topics, we highly recommend reading the book 'Getting Started with DuckDB' by Simon Aubury and Ned Letcher."Author BioSimon Aubury has been working in the IT industry since 2000 as a data engineering specialist. He has an extensive background in building large, flexible, highly available distributed data systems. Simon has delivered critical data systems for finance, transport, healthcare, insurance, and telecommunications clients in Australia, Europe, and Asia Pacific. In 2019, Simon joined ThoughtWorks as a principal data engineer and today is associate director of data platforms at Simple Machines in Sydney, Australia. Simon is active in the data community, a regular conference speaker, and the organizer of local and international meetups and data engineering conferences.Ned Letcher has worked as a data science and software engineering consultant since completing his PhD in computational linguistics in 2018 and currently works at Thoughtworks. He has designed and developed data-powered products and services across a range of industries and helped organizations and teams improve the effectiveness of their data processes and workflows. Ned has also worked as a Python trainer, supporting both tertiary students and data professionals across various organizations. He is active in the data community, speaking at and helping organize meetups and conferences, as well as contributing to a range of open source projects.
Read more
  • 0
  • 0
  • 424
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime