Identifying Outliers in Subsets of Data
Outliers and unexpected values may not be errors. They often are not. Individuals and events are complicated and surprise the analyst. Some people really are 7’4” tall and some really have $50 million salaries. Sometimes, data is messy because people and situations are messy; however, extreme values can have an out-sized impact on our analysis, particularly when we are using parametric techniques that assume a normal distribution.
These issues may become even more apparent when working with subsets of data. That is not just because extreme or unexpected values have more weight with smaller samples. It is also because they may make less sense when bivariate and multivariate relationships are considered. When the 7’4” person, or the person making $50 million, is 10 years old, the red flag gets even redder. This may suggest some measurement or data collection error.
But the key issue is the undue influence that...