Extreme values and outliers
An outlier or extreme value is defined as a data point that deviates so far from the other observations, that it becomes suspicious to be generated by a totally different mechanism or simply by error. Identifying outliers is important because those extreme values can:
Increase error variance
Influence estimates
Decrease normality
Or in other words, let's say your raw dataset is a piece of rounded stone to be used as a perfect ball in some game, which has to be cleaned and polished before actually using it. The stone has some small holes on its surface, like missing values in the data, which should be filled – with data imputation.
On the other hand, the stone does not only has holes on its surface, but some mud also covers some parts of the item, which is to be removed. But how can we distinguish mud from the real stone? In this section, we will focus on what the outliers
package and some related methods have to offer for identifying extreme values.
As this package...