Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Data Cleaning Cookbook

You're reading from   Python Data Cleaning Cookbook Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI

Arrow left icon
Product type Paperback
Published in May 2024
Publisher Packt
ISBN-13 9781803239873
Length 486 pages
Edition 2nd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Michael Walker Michael Walker
Author Profile Icon Michael Walker
Michael Walker
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Anticipating Data Cleaning Issues When Importing Tabular Data with pandas FREE CHAPTER 2. Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data 3. Taking the Measure of Your Data 4. Identifying Outliers in Subsets of Data 5. Using Visualizations for the Identification of Unexpected Values 6. Cleaning and Exploring Data with Series Operations 7. Identifying and Fixing Missing Values 8. Encoding, Transforming, and Scaling Features 9. Fixing Messy Data When Aggregating 10. Addressing Data Issues When Combining DataFrames 11. Tidying and Reshaping Data 12. Automate Data Cleaning with User-Defined Functions, Classes, and Pipelines 13. Index

What this book covers

Chapter 1, Anticipating Data Cleaning Issues When Importing Tabular Data with pandas, explores tools for loading CSV files, Excel files, relational database tables, SAS, SPSS, Stata, and R files into pandas DataFrames.

Chapter 2, Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data, discusses techniques for reading and normalizing JSON data, web scraping, and working with big data using Spark. It also explores techniques for persisting data, including with versioning.

Chapter 3, Taking the Measure of Your Data, introduces common techniques for navigating around a DataFrame, selecting columns and rows, and generating summary statistics. The use of OpenAI tools for examining dataset structure and generating statistics is introduced.

Chapter 4, Identifying Outliers in Subsets of Data, explores a wide range of strategies to identify outliers across a whole DataFrame and by selected groups.

Chapter 5, Using Visualizations for the Identification of Unexpected Values, demonstrates the use of the Matplotlib and Seaborn tools to visualize how key variables are distributed, including with histograms, boxplots, scatter plots, line plots, and violin plots.

Chapter 6, Cleaning and Exploring Data with Series Operations, discusses updating pandas Series with scalars, arithmetic operations, and conditional statements based on the values of one or more Series.

Chapter 7, Identifying and Fixing Missing Values, goes over strategies for identifying missing values across rows and columns, and over subsets of data. It explores strategies for imputing values, such as setting values to the overall mean or the mean for a given category and forward filling. It also examines multivariate techniques for imputing values for missing values and discusses when they are appropriate.

Chapter 8, Encoding, Transforming, and Scaling Features, covers a range of variable transformation techniques to prepare features and targets for predictive analysis. This includes the most common kinds of encoding—one-hot, ordinal, and hashing encoding; transformations to improve the distribution of variables; and binning and scaling approaches to address skewness, kurtosis, and outliers and to adjust for features with widely different ranges.

Chapter 9, Fixing Messy Data When Aggregating, demonstrates multiple approaches to aggregating data by group, including looping through data with itertuples or NumPy arrays, dropping duplicate rows, and using pandas’ groupby and pivot tables. It also discusses when to choose one approach over the others.

Chapter 10, Addressing Data Issues When Combining DataFrames, examines different strategies for concatenating and merging data, and how to anticipate common data challenges when combining data.

Chapter 11, Tidying and Reshaping Data, introduces several strategies for de-duplicating, stacking, melting, and pivoting data.

Chapter 12, Automate Data Cleaning with User-Defined Functions and Classes and Pipelines, examines how to turn many of the techniques from the first 11 chapters into reuseable code.

Download the example code files

The code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-Cookbook-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/gbp/9781803239873.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.”

A block of code is set as follows:

import pandas as pd
import os
import sys
nls97 = pd.read_csv("data/nls97g.csv", low_memory=False)
nls97.set_index('personid', inplace=True)

Any output from the code will appear like this:

       satverbal  satmath
min           14        7
per15        390      390
qr1          430      430
med          500      500
qr3          570      580
per85        620      621
max          800      800
count      1,406    1,407
mean         500      501
iqr          140      150

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: “Select System info from the Administration panel.”

Warnings or important notes appear like this.

Tips and tricks appear like this.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image