Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Data Cleaning and Preparation Best Practices

You're reading from   Python Data Cleaning and Preparation Best Practices A practical guide to organizing and handling data from various sources and formats using Python

Arrow left icon
Product type Paperback
Published in Sep 2024
Publisher Packt
ISBN-13 9781837634743
Length 456 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Maria Zervou Maria Zervou
Author Profile Icon Maria Zervou
Maria Zervou
Arrow right icon
View More author details
Toc

Table of Contents (19) Chapters Close

Preface 1. Part 1: Upstream Data Ingestion and Cleaning
2. Chapter 1: Data Ingestion Techniques FREE CHAPTER 3. Chapter 2: Importance of Data Quality 4. Chapter 3: Data Profiling – Understanding Data Structure, Quality, and Distribution 5. Chapter 4: Cleaning Messy Data and Data Manipulation 6. Chapter 5: Data Transformation – Merging and Concatenating 7. Chapter 6: Data Grouping, Aggregation, Filtering, and Applying Functions 8. Chapter 7: Data Sinks 9. Part 2: Downstream Data Cleaning – Consuming Structured Data
10. Chapter 8: Detecting and Handling Missing Values and Outliers 11. Chapter 9: Normalization and Standardization 12. Chapter 10: Handling Categorical Features 13. Chapter 11: Consuming Time Series Data 14. Part 3: Downstream Data Cleaning – Consuming Unstructured Data
15. Chapter 12: Text Preprocessing in the Era of LLMs 16. Chapter 13: Image and Audio Preprocessing with LLMs 17. Index 18. Other Books You May Enjoy

What this book covers

Chapter 1, Data Ingestion Techniques, provides a comprehensive overview of the data ingestion process, emphasizing its role in collecting and importing data from various sources into storage systems for analysis. You will explore different ingestion methods such as batch and streaming modes, compare real-time and semi-real-time ingestion, and understand the technologies behind data sources. The chapter highlights the advantages, disadvantages, and practical applications of these methods.

Chapter 2, Importance of Data Quality, emphasizes the critical role data quality plays in business decision-making. It highlights the risks of using inaccurate, inconsistent, or outdated data, which can lead to poor decisions, damaged reputations, and missed opportunities. You will explore why data quality is essential, how to measure it across different dimensions, and the impact of data silos on maintaining data quality.

Chapter 3, Data Profiling – Understanding Data Structure, Quality, and Distribution, explores data profiling and focuses on scrutinizing and validating datasets to understand their structure, patterns, and quality. You will learn how to perform data profiling using tools such as the pandas Profiler and Great Expectations and understand when to use each tool. Additionally, the chapter covers tactics for handling large data volumes and compares profiling methods to improve data validation.

Chapter 4, Cleaning Messy Data and Data Manipulation, focuses on the key strategies for cleaning and manipulating data, enabling efficient and accurate analysis. It covers techniques for renaming columns, removing irrelevant or redundant data, fixing inconsistent data types, and handling date and time formats. By mastering these methods, you will learn how to enhance the quality and reliability of your datasets.

Chapter 5, Data Transformation – Merging and Concatenating, explores techniques for transforming and manipulating data through merging, joining, and concatenating datasets. It covers methods to combine multiple datasets from various sources, handle duplicates effectively, and improve merging performance. The chapter also provides practical tricks to streamline the merging process, ensuring efficient data integration for insightful analysis.

Chapter 6, Data Grouping, Aggregation, Filtering, and Applying Functions, covers the essential techniques of data grouping and aggregation, which are vital for summarizing large datasets and generating meaningful insights. It discusses methods to handle missing or noisy data by aggregating values, reducing data volume, and enhancing processing efficiency. The chapter also focuses on grouping data by various keys, applying aggregate and custom functions, and filtering data to create valuable features for deeper analysis or ML.

Chapter 7, Data Sinks, focuses on the critical decisions involved in data processing, particularly the selection of appropriate data sinks for storage and processing needs. It delves into four essential pillars: choosing the right data sink, selecting the correct file type, optimizing partitioning strategies, and understanding how to design a scalable online retail data platform. The chapter equips you with the tools to enhance efficiency, scalability, and performance in data processing pipelines.

Chapter 8, Detecting and Handling Missing Values and Outliers, delves into techniques for identifying and managing missing values and outliers. It covers a range of methods, from statistical approaches to advanced ML models, to address these issues effectively. The key areas of focus include detecting and handling missing data, identifying univariate and multivariate outliers, and managing outliers in various datasets.

Chapter 9, Normalization and Standardization, covers essential preprocessing techniques such as feature scaling, normalization, and standardization, which ensure that ML models can effectively learn from data. You will explore different techniques, including scaling features to a range, Z-score scaling, and using a robust scaler, to address various data challenges in ML tasks.

Chapter 10, Handling Categorical Features, addresses the importance of managing categorical features, which represent non-numerical information in datasets. You will learn various encoding techniques, including label encoding, one-hot encoding, target encoding, frequency encoding, and binary encoding, to transform categorical data for ML models.

Chapter 11, Consuming Time Series Data, delves into the fundamentals of time series analysis, covering key concepts, methodologies, and applications across various industries. It includes understanding the components and types of time series data, identifying and handling missing values, and techniques for analyzing trends and patterns over time. The chapter also addresses dealing with outliers and feature engineering to enhance predictive modeling with time series data.

Chapter 12, Text Preprocessing in the Era of LLMs, focuses on mastering text preprocessing techniques that are essential for optimizing the performance of LLMs. It covers methods for cleaning text, handling rare words and spelling variations, chunking, and tokenization strategies. Additionally, it addresses the transformation of tokens into embeddings, highlighting the importance of adapting preprocessing approaches to maximize the potential of LLMs.

Chapter 13, Image and Audio Preprocessing with LLMs, examines preprocessing techniques for unstructured data, particularly images and audio, to extract meaningful information. It includes methods for image preprocessing, such as optical character recognition (OCR) and image caption generation with the BLIP model. The chapter also explores audio data handling, including converting audio to text using the Whisper model, providing a comprehensive overview of working with multimedia data in the context of LLMs.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image