Using linear regression to identify data points with significant influence
The remaining recipes in this chapter use statistical modeling to identify outliers. The advantage of these techniques is that they are less dependent on the distribution of the variable of concern, and take more into account than can be revealed in either univariate or bivariate analyses. This allows us to identify outliers that are not otherwise apparent. On the other hand, by taking more factors into account, multivariate techniques may provide evidence that a previously suspect value is actually within an expected range, and provides meaningful information.
In this recipe, we use linear regression to identify observations (rows) that have an out-sized influence on models of a target or dependent variable. This can indicate that one or more values for a few observations are so extreme that they compromise the model fit for all of the other observations.
Getting ready
The code in this recipe requires...