Chapter 7: Fixing Messy Data when Aggregating
Earlier chapters of this book introduced techniques for generating summary statistics on a whole DataFrame. We used methods such as describe
, mean
, and quantile
to do that. This chapter covers more complicated aggregation tasks: aggregating by categorical variables, and using aggregation to change the structure of DataFrames.
After the initial stages of data cleaning, analysts spend a substantial amount of their time doing what Hadley Wickham has called splitting-applying-combining. That is, we subset data by groups, apply some operation to those subsets, and then draw conclusions about a dataset as a whole. In slightly more specific terms, this involves generating descriptive statistics by key categorical variables. For the nls97
dataset, this might be gender, marital status, and highest degree received. For the COVID-19 data, we might segment the data by country or date.
Often, we need to aggregate data to prepare it for subsequent...