Finding outliers using the mean and standard deviation
In normally distributed variables, around 99.8% of the observations lie within the interval comprising the mean plus and minus three times the standard deviation. Thus, values beyond those limits can be considered outliers; they are rare.
Note
Using the mean and standard deviation to detect outliers has some drawbacks. Firstly, it assumes a normal distribution, including outliers. Secondly, outliers strongly influence the mean and standard deviation. Therefore, a recommended alternative is the Median Absolute Deviation (MAD), which we’ll discuss in the next recipe.
In this recipe, we will identify outliers as those observations that lie outside the interval delimited by the mean plus and minus three times the standard deviation.
How to do it...
Let’s begin the recipe by importing the Python libraries and loading the dataset:
- Let’s import the Python libraries and dataset:
import numpy as...