Performing Data Aggregation
Alright. We are getting close to the end of this chapter. But before we wrap it up, there is one more technique to explore for creating new features: data aggregation. The idea behind it is to summarize a numerical column for specific groups from another column. We already saw an example of how to aggregate two numerical variables from the ATO dataset (Average net tax and Average total deductions) for each cluster found by k-means using the .pivot_table()
method in Chapter 5, Performing Your First Cluster Analysis. But at that time, we aggregated the data not to create new features but to understand the difference between these clusters.
You may wonder to yourself in which cases you would want to perform feature engineering using data aggregation. If you already have a numerical column that contains a value for each record, why would you need to summarize it and add this information back to the DataFrame? It feels like we are just adding the same information...