What this book covers
Chapter 1, Anticipating Data Cleaning Issues When Importing Tabular Data with pandas, explores tools for loading CSV files, Excel files, relational database tables, SAS, SPSS, Stata, and R files into pandas DataFrames.
Chapter 2, Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data, discusses techniques for reading and normalizing JSON data, web scraping, and working with big data using Spark. It also explores techniques for persisting data, including with versioning.
Chapter 3, Taking the Measure of Your Data, introduces common techniques for navigating around a DataFrame, selecting columns and rows, and generating summary statistics. The use of OpenAI tools for examining dataset structure and generating statistics is introduced.
Chapter 4, Identifying Outliers in Subsets of Data, explores a wide range of strategies to identify outliers across a whole DataFrame and by selected groups.
Chapter 5, Using Visualizations for the Identification of Unexpected Values, demonstrates the use of the Matplotlib and Seaborn tools to visualize how key variables are distributed, including with histograms, boxplots, scatter plots, line plots, and violin plots.
Chapter 6, Cleaning and Exploring Data with Series Operations, discusses updating pandas Series with scalars, arithmetic operations, and conditional statements based on the values of one or more Series.
Chapter 7, Identifying and Fixing Missing Values, goes over strategies for identifying missing values across rows and columns, and over subsets of data. It explores strategies for imputing values, such as setting values to the overall mean or the mean for a given category and forward filling. It also examines multivariate techniques for imputing values for missing values and discusses when they are appropriate.
Chapter 8, Encoding, Transforming, and Scaling Features, covers a range of variable transformation techniques to prepare features and targets for predictive analysis. This includes the most common kinds of encoding—one-hot, ordinal, and hashing encoding; transformations to improve the distribution of variables; and binning and scaling approaches to address skewness, kurtosis, and outliers and to adjust for features with widely different ranges.
Chapter 9, Fixing Messy Data When Aggregating, demonstrates multiple approaches to aggregating data by group, including looping through data with itertuples
or NumPy arrays, dropping duplicate rows, and using pandas’ groupby and pivot tables. It also discusses when to choose one approach over the others.
Chapter 10, Addressing Data Issues When Combining DataFrames, examines different strategies for concatenating and merging data, and how to anticipate common data challenges when combining data.
Chapter 11, Tidying and Reshaping Data, introduces several strategies for de-duplicating, stacking, melting, and pivoting data.
Chapter 12, Automate Data Cleaning with User-Defined Functions and Classes and Pipelines, examines how to turn many of the techniques from the first 11 chapters into reuseable code.
Download the example code files
The code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-Cookbook-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Download the color images
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/gbp/9781803239873.
Conventions used
There are a number of text conventions used throughout this book.
Code in text
: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Mount the downloaded WebStorm-10*.dmg
disk image file as another disk in your system.”
A block of code is set as follows:
import pandas as pd
import os
import sys
nls97 = pd.read_csv("data/nls97g.csv", low_memory=False)
nls97.set_index('personid', inplace=True)
Any output from the code will appear like this:
satverbal satmath
min 14 7
per15 390 390
qr1 430 430
med 500 500
qr3 570 580
per85 620 621
max 800 800
count 1,406 1,407
mean 500 501
iqr 140 150
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: “Select System info from the Administration panel.”
Warnings or important notes appear like this.
Tips and tricks appear like this.