You're reading from Python Data Cleaning Cookbook Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI

Product type Paperback

Published in May 2024

Publisher Packt

ISBN-13 9781803239873

Length 486 pages

Edition 2nd Edition

Languages

Python

Tools

Matplotlib

Concepts

Data Analysis

Author (1):

Michael Walker

View More author details

Table of Contents (14) Chapters

Preface

1. Anticipating Data Cleaning Issues When Importing Tabular Data with pandas FREE CHAPTER

2. Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data

3. Taking the Measure of Your Data

4. Identifying Outliers in Subsets of Data

5. Using Visualizations for the Identification of Unexpected Values

6. Cleaning and Exploring Data with Series Operations

7. Identifying and Fixing Missing Values

8. Encoding, Transforming, and Scaling Features

9. Fixing Messy Data When Aggregating

10. Addressing Data Issues When Combining DataFrames

11. Tidying and Reshaping Data

12. Automate Data Cleaning with User-Defined Functions, Classes, and Pipelines

13. Index

What this book covers

Chapter 1, Anticipating Data Cleaning Issues When Importing Tabular Data with pandas, explores tools for loading CSV files, Excel files, relational database tables, SAS, SPSS, Stata, and R files into pandas DataFrames.

Chapter 2, Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data, discusses techniques for reading and normalizing JSON data, web scraping, and working with big data using Spark. It also explores techniques for persisting data, including with versioning.

Chapter 3, Taking the Measure of Your Data, introduces common techniques for navigating around a DataFrame, selecting columns and rows, and generating summary statistics. The use of OpenAI tools for examining dataset structure and generating statistics is introduced.

Chapter 4, Identifying Outliers in Subsets of Data, explores a wide range of strategies to identify outliers across a whole DataFrame and by selected groups.

Chapter 5, Using Visualizations for the Identification of Unexpected Values, demonstrates the use of the Matplotlib and Seaborn tools to visualize how key variables are distributed, including with histograms, boxplots, scatter plots, line plots, and violin plots.

Chapter 6, Cleaning and Exploring Data with Series Operations, discusses updating pandas Series with scalars, arithmetic operations, and conditional statements based on the values of one or more Series.

Chapter 7, Identifying and Fixing Missing Values, goes over strategies for identifying missing values across rows and columns, and over subsets of data. It explores strategies for imputing values, such as setting values to the overall mean or the mean for a given category and forward filling. It also examines multivariate techniques for imputing values for missing values and discusses when they are appropriate.

Chapter 8, Encoding, Transforming, and Scaling Features, covers a range of variable transformation techniques to prepare features and targets for predictive analysis. This includes the most common kinds of encoding—one-hot, ordinal, and hashing encoding; transformations to improve the distribution of variables; and binning and scaling approaches to address skewness, kurtosis, and outliers and to adjust for features with widely different ranges.

Chapter 9, Fixing Messy Data When Aggregating, demonstrates multiple approaches to aggregating data by group, including looping through data with itertuples or NumPy arrays, dropping duplicate rows, and using pandas’ groupby and pivot tables. It also discusses when to choose one approach over the others.

Chapter 10, Addressing Data Issues When Combining DataFrames, examines different strategies for concatenating and merging data, and how to anticipate common data challenges when combining data.

Chapter 11, Tidying and Reshaping Data, introduces several strategies for de-duplicating, stacking, melting, and pivoting data.

Chapter 12, Automate Data Cleaning with User-Defined Functions and Classes and Pipelines, examines how to turn many of the techniques from the first 11 chapters into reuseable code.

Download the example code files

The code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-Cookbook-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/gbp/9781803239873.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.”

A block of code is set as follows:

import pandas as pd
import os
import sys
nls97 = pd.read_csv("data/nls97g.csv", low_memory=False)
nls97.set_index('personid', inplace=True)

Any output from the code will appear like this:

       satverbal  satmath
min           14        7
per15        390      390
qr1          430      430
med          500      500
qr3          570      580
per85        620      621
max          800      800
count      1,406    1,407
mean         500      501
iqr          140      150

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: “Select System info from the Administration panel.”

Warnings or important notes appear like this.

Tips and tricks appear like this.