Chapter 4: Identifying Missing Values and Outliers in Subsets of Data
Outliers and unexpected values may not be errors. They often are not. Individuals and events are complicated and surprise the analyst. Some people really are 7'4" tall and some really have $50 million salaries. Sometimes, data is messy because people and situations are messy; however, extreme values can have an outsized impact on our analysis, particularly when we are using parametric techniques that assume a normal distribution.
These issues may become even more apparent when working with subsets of data. That is not just because extreme or unexpected values have more weight in smaller samples. It is also because they may make less sense when bivariate and multivariate relationships are considered. When the 7'4" person, or the person making $50 million, is 10 years old, the red flag gets even redder. We take these complications into account in this chapter when considering strategies for detecting...