You're reading from Python Data Cleaning and Preparation Best Practices A practical guide to organizing and handling data from various sources and formats using Python

Product type Paperback

Published in Sep 2024

Publisher Packt

ISBN-13 9781837634743

Length 456 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Author (1):

Maria Zervou

View More author details

Table of Contents (19) Chapters

Preface

1. Part 1: Upstream Data Ingestion and Cleaning

2. Chapter 1: Data Ingestion Techniques FREE CHAPTER

3. Chapter 2: Importance of Data Quality

4. Chapter 3: Data Profiling – Understanding Data Structure, Quality, and Distribution

5. Chapter 4: Cleaning Messy Data and Data Manipulation

6. Chapter 5: Data Transformation – Merging and Concatenating

7. Chapter 6: Data Grouping, Aggregation, Filtering, and Applying Functions

8. Chapter 7: Data Sinks

9. Part 2: Downstream Data Cleaning – Consuming Structured Data

10. Chapter 8: Detecting and Handling Missing Values and Outliers

11. Chapter 9: Normalization and Standardization

12. Chapter 10: Handling Categorical Features

13. Chapter 11: Consuming Time Series Data

14. Part 3: Downstream Data Cleaning – Consuming Unstructured Data

15. Chapter 12: Text Preprocessing in the Era of LLMs

16. Chapter 13: Image and Audio Preprocessing with LLMs

17. Index

Why subscribe?

18. Other Books You May Enjoy

Preface

In today’s fast-paced data-driven world, it’s easy to be dazzled by the headlines about artificial intelligence (AI) breakthroughs and advanced machine learning (ML) models. But ask any seasoned data scientist or engineer, and they’ll tell you the same thing: the true foundation of any successful data project is not the flashy algorithms or sophisticated models—it’s the data itself, and more importantly, how that data is prepared.

Throughout my career, I have learned that data preprocessing is the unsung hero of data science. It’s the meticulous, often complex process that turns raw data into a reliable asset, ready for analysis, modeling, and ultimately, decision-making. I’ve seen firsthand how the right preprocessing techniques can transform an organization’s approach to data, turning potential challenges into powerful opportunities.

Yet, despite its importance, data preprocessing is often overlooked or undervalued. Many see it as a tedious step, a bottleneck that slows down the exciting work of building models and delivering insights. But I’ve always believed that this phase is where the most critical work happens. After all, even the most sophisticated algorithms can’t make up for poor-quality data. That’s why I’ve dedicated much of my professional journey to mastering this art—exploring the best tools, techniques, and strategies to make preprocessing more efficient, scalable, and aligned with the ever-evolving landscape of AI.

This book aims to demystify the data preprocessing process, offering both a solid grounding in traditional methods and a forward-looking perspective on emerging techniques. We’ll explore how Python can be leveraged to clean, transform, and organize data more effectively. We’ll also look at how the advent of large language models (LLMs) is redefining what’s possible in this space. These models are already proving to be game changers, automating tasks that were once manual and time-consuming, and providing new ways to enhance data quality and usability.

Throughout the pages, I’ll share insights from my experiences, the challenges faced, and the lessons learned along the way. My hope is to provide you with not just a technical roadmap but also a deeper understanding of the strategic importance of data preprocessing in today’s data ecosystem. I strongly believe in the philosophy of “learning by doing,” so this book includes a wealth of code examples for you to follow along with. I encourage you to try these examples, experiment with the code, and challenge yourself to apply the techniques to your own datasets.

By the end of this book, you’ll be equipped with the knowledge and skills to approach data preprocessing not just as a necessary step but also as a critical component of your overall data strategy.

So, whether you’re a data scientist, engineer, analyst, or simply someone looking to enhance their understanding of data processes, I invite you to join me on this journey. Together, we will explore how to harness the power of data preprocessing to unlock the full potential of your data.