Finding outliers using the mean and standard deviation
In normally distributed variables, more than 99% of the observations lie within the interval comprising the mean plus or minus three times the standard deviation. Thus, any values beyond those limits can be considered outliers. In this recipe, we will identify outliers as those observations that lie outside of this interval.
How to do it...
Let’s begin the recipe by importing the Python libraries and loading the dataset:
- Import the required Python libraries:
import numpy as np import pandas as pd from sklearn.datasets import load_breast_cancer
- Let’s load the Breast Cancer dataset from
scikit-learn
:breast_cancer = load_breast_cancer() X = pd.DataFrame( breast_cancer.data, columns=breast_cancer.feature_names )
- Let’s create a function that returns the mean plus and minus
fold
times the standard deviation, wherefold
is a parameter to the function...