Identifying outliers
Outliers are unusually high or low values that occur in a dataset. When compared to other observations in a dataset, outliers typically stand out as different and are considered to be extreme values. Some of the reasons outliers occur in a dataset include genuine extreme values, measurement errors, data entry errors, and data processing errors. Measurement errors are typically caused by faulty systems, such as weighing scales, sensors, and so on. Data entry errors occur when inaccurate inputs are provided by users. Examples include mistyping inputs, providing wrong data formats, or swapping values (transposition errors). Processing errors can occur during data aggregation or transformation to generate a final output.
It is very important to spot and handle outliers because they can lead to wrong conclusions and distort any analysis. The following example shows this:
PersonID |
Industry... |