Grouping and summarizing data
Grouping and summarizing are two complementary functions. Generally, they will be used together, as there is not much use in grouping a dataset and not calculating anything or using the groups for a purpose. That is when summarizing plays the important role of transforming the data from each group into a summary or a number that we can understand.
In the business world, requests such as the average number of sales by store, the median number of customers by day, the standard deviation of a distribution, and many other examples, are part of the routine of a data scientist. These tasks can be performed using the group_by()
and summarise()
functions from dplyr
.
Starting with the group_by()
function, observe that it alone cannot bring much value:
# group by not summarized df %>% group_by(workclass)
Here is the result.
Figure 8.9 – Dataset grouped but not summarized
We can see in Figure 8.9 that it worked because...