Removing duplicated rows
There are several reasons why we might have data duplicated at the unit of analysis:
- The existing DataFrame may be the result of a one-to-many merge, and the one side is the unit of analysis.
- The DataFrame is repeated measures or panel data collapsed into a flat file, which is just a special case of the first situation.
- We may be working with an analysis file where multiple one-to-many relationships have been flattened, creating many-to-many relationships.
When the one side is the unit of analysis, data on the many side may need to be collapsed in some way. For example, if we are analyzing outcomes for a cohort of students at a college, students are the unit of analysis; but we may also have course enrollment data for each student. To prepare the data for analysis, we might need to first count the number of courses, sum the total credits, or calculate the GPA for each student, before ending up with one row per student. To generalize...