You're reading from Python Data Cleaning and Preparation Best Practices A practical guide to organizing and handling data from various sources and formats using Python

Product type Paperback

Published in Sep 2024

Publisher Packt

ISBN-13 9781837634743

Length 456 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Author (1):

Maria Zervou

View More author details

Table of Contents (19) Chapters

Preface

1. Part 1: Upstream Data Ingestion and Cleaning

2. Chapter 1: Data Ingestion Techniques FREE CHAPTER

3. Chapter 2: Importance of Data Quality

4. Chapter 3: Data Profiling – Understanding Data Structure, Quality, and Distribution

5. Chapter 4: Cleaning Messy Data and Data Manipulation

6. Chapter 5: Data Transformation – Merging and Concatenating

7. Chapter 6: Data Grouping, Aggregation, Filtering, and Applying Functions

8. Chapter 7: Data Sinks

9. Part 2: Downstream Data Cleaning – Consuming Structured Data

10. Chapter 8: Detecting and Handling Missing Values and Outliers

11. Chapter 9: Normalization and Standardization

12. Chapter 10: Handling Categorical Features

13. Chapter 11: Consuming Time Series Data

14. Part 3: Downstream Data Cleaning – Consuming Unstructured Data

15. Chapter 12: Text Preprocessing in the Era of LLMs

16. Chapter 13: Image and Audio Preprocessing with LLMs

17. Index

Why subscribe?

18. Other Books You May Enjoy

Handling duplicates when merging datasets

Handling duplicate keys before performing merge operations is crucial because duplicates can lead to unexpected results, such as Cartesian products, where rows are multiplied by the number of matching entries. This can not only distort the data analysis but also significantly impact performance due to the increased size of the resulting DataFrame.

Why handle duplication in rows and columns?

Duplicate keys can lead to a range of problems that may compromise the accuracy of your results and the efficiency of your data processing. Let’s explore why it’s a good idea to handle duplicate keys prior to merging data:

If there are duplicate keys in either table, merging these tables can result in a Cartesian product, where each duplicate key in one table matches with each occurrence of the same key in the other table, leading to an exponential increase in the number of rows
Duplicate keys might represent data errors or...