An outlier is a data point that is significantly different from the remaining data. On occasions, outliers are very informative; for example, when looking for credit card transactions, an outlier may be an indication of fraud. In other cases, outliers are rare observations that do not add any additional information. These cases may also affect the performance of some machine learning models.
Highlighting outliers
Getting ready
In this recipe, we will learn how to identify outliers using boxplots and the inter-quartile range (IQR) proximity rule. According to the IQR proximity rule, a value is an outlier if it falls outside these boundaries:
Upper boundary = 75th quantile + (IQR * 1.5)
Lower boundary = 25th quantile - (IQR * 1.5)
Here, IQR is given by the following equation:
IQR = 75th quantile - 25th quantile
How to do it...
Let's begin by importing the necessary libraries and preparing the dataset:
- Import the required Python libraries and the dataset:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_boston
- Load the Boston House Prices dataset from scikit-learn and retain three of its variables in a dataframe:
boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)[['RM', 'LSTAT', 'CRIM']]
- Make a boxplot for the RM variable:
sns.boxplot(y=boston['RM'])
plt.title('Boxplot')
The output of the preceding code is as follows:
plt.figure(figsize=(3,6))
sns.boxplot(y=boston['RM'])
plt.title('Boxplot')
To find the outliers in a variable, we need to find the distribution boundaries according to the IQR proximity rule, which we discussed in the Getting ready section of this recipe.
- Create a function that takes a dataframe, a variable name, and the factor to use in the IQR calculation and returns the IQR proximity rule boundaries:
def find_boundaries(df, variable, distance):
IQR = df[variable].quantile(0.75) - df[variable].quantile(0.25)
lower_boundary = df[variable].quantile(0.25) - (IQR * distance)
upper_boundary = df[variable].quantile(0.75) + (IQR * distance)
return upper_boundary, lower_boundary
- Calculate and then display the IQR proximity rule boundaries for the RM variable:
upper_boundary, lower_boundary = find_boundaries(boston, 'RM', 1.5)
upper_boundary, lower_boundary
The find_boundaries() function returns the values above and below which we can consider a value to be an outlier, as shown here:
(7.730499999999999, 4.778500000000001)
Now, we need to find the outliers in the dataframe.
- Create a boolean vector to flag observations outside the boundaries we determined in step 5:
outliers = np.where(boston['RM'] > upper_boundary, True,
np.where(boston['RM'] < lower_boundary, True, False))
- Create a new dataframe with the outlier values and then display the top five rows:
outliers_df = boston.loc[outliers, 'RM']
outliers_df.head()
We can see the top five outliers in the RM variable in the following output:
97 8.069 98 7.820 162 7.802 163 8.375 166 7.929 Name: RM, dtype: float64
To remove the outliers from the dataset, execute boston.loc[~outliers, 'RM'].
How it works...
In this recipe, we identified outliers in the numerical variables of the Boston House Prices dataset from scikit-learn using boxplots and the IQR proximity rule. To proceed with this recipe, we loaded the dataset from scikit-learn and created a boxplot for one of its numerical variables as an example. Next, we created a function to identify the boundaries using the IQR proximity rule and used the function to determine the boundaries of the numerical RM variable. Finally, we identified the values of RM that were higher or lower than those boundaries, that is, the outliers.
To load the data, we imported the dataset from sklearn.datasets and used load_boston(). Next, we captured the data in a dataframe using pandas DataFrame(), indicating that the data was stored in the data attribute and that the variable names were stored in the feature_names attribute. To retain only the RM, LSTAT, and CRIM variables, we passed the column names in double brackets [[]] at the back of pandas DataFrame().
To display the boxplot, we used seaborn's boxplot() method and passed the pandas Series with the RM variable as an argument. In the boxplot displayed after step 3, the IQR is delimited by the rectangle, and the upper and lower boundaries corresponding to either, the 75th quantile plus 1.5 times the IQR, or the 25th quantile minus 1.5 times the IQR. This is indicated by the whiskers. The outliers are the asterisks lying outside the whiskers.
To identify those outliers in our dataframe, in step 4, we created a function to find the boundaries according to the IQR proximity rule. The function took the dataframe and the variable as arguments and calculated the IQR and the boundaries using the formula described in the Getting ready section of this recipe. With the pandas quantile() method, we calculated the values for the 25th (0.25) and 75th quantiles (0.75). The function returned the upper and lower boundaries for the RM variable.
To find the outliers of RM, we used NumPy's where() method, which produced a boolean vector with True if the value was an outlier. Briefly, where() scanned the rows of the RM variable, and if the value was bigger than the upper boundary, it assigned True, whereas if the value was smaller, the second where() nested inside the first one and checked whether the value was smaller than the lower boundary, in which case it also assigned True, otherwise False. Finally, we used the loc[] method from pandas to capture only those values in the RM variable that were outliers in a new dataframe.