Uncovering Simpson's Paradox in the diamonds dataset with seaborn
It is unfortunately quite easy to report erroneous results when doing data analysis. Simpson's Paradox is one of the more common phenomena that can appear. It occurs when one group shows a higher result than another group, when all the data is aggregated, but it shows the opposite when the data is subdivided into different segments. For instance, let's say we have two students, A and B, who have each been given a test with 100 questions on it. Student A answers 50% of the questions correct, while Student B gets 80% correct. This obviously suggests Student B has greater aptitude:
Student | Raw Score | Percent Correct |
A |
50/100 |
50 ... |