Removing outliers
Recent studies distinguish three types of outliers: error outliers, interesting outliers, and random outliers. Error outliers are likely due to human or methodological errors and should be either corrected or removed from the data analysis. In this recipe, we’ll assume outliers are errors (you don’t want to remove interesting or random outliers) and remove them from the dataset.
How to do it...
We’ll use the IQR proximity rule to find the outliers and then remove them from the data using pandas and Feature-engine:
- Let’s import the Python libraries, functions, and classes:
import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from feature_engine.outliers import OutlierTrimmer
- Load the California housing dataset from scikit-learn and separate it into train and test sets:
X, y = fetch_california_housing( ...