You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Product type Paperback

Published in Jan 2020

Publisher Packt

ISBN-13 9781789806311

Length 372 pages

Edition 1st Edition

Languages

Python

Tools

NumPy

Concepts

Machine Learning

Author (1):

Soledad Galli

View More author details

Highlighting outliers

An outlier is a data point that is significantly different from the remaining data. On occasions, outliers are very informative; for example, when looking for credit card transactions, an outlier may be an indication of fraud. In other cases, outliers are rare observations that do not add any additional information. These cases may also affect the performance of some machine learning models.

"An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism." [D. Hawkins. Identification of Outliers, Chapman and Hall, 1980.]

Getting ready

In this recipe, we will learn how to identify outliers using boxplots and the inter-quartile range (IQR) proximity rule. According to the IQR proximity rule, a value is an outlier if it falls outside these boundaries:

Upper boundary = 75th quantile + (IQR * 1.5)

Lower boundary = 25th quantile - (IQR * 1.5)

Here, IQR is given by the following equation:

IQR = 75th quantile - 25th quantile

Typically, we calculate the IQR proximity rule boundaries by multiplying the IQR by 1.5. However, it is also common practice to find extreme values by multiplying the IQR by 3.

How to do it...

Let's begin by importing the necessary libraries and preparing the dataset:

Import the required Python libraries and the dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_boston

Load the Boston House Prices dataset from scikit-learn and retain three of its variables in a dataframe:

boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)[['RM', 'LSTAT', 'CRIM']]

Make a boxplot for the RM variable:

sns.boxplot(y=boston['RM'])
plt.title('Boxplot')

The output of the preceding code is as follows:

We can change the final size of the plot using the figure() method from Matplotlib. We need to call this command before making the plot with seaborn:
plt.figure(figsize=(3,6))
sns.boxplot(y=boston['RM'])
plt.title('Boxplot')

To find the outliers in a variable, we need to find the distribution boundaries according to the IQR proximity rule, which we discussed in the Getting ready section of this recipe.

Create a function that takes a dataframe, a variable name, and the factor to use in the IQR calculation and returns the IQR proximity rule boundaries:

def find_boundaries(df, variable, distance):

    IQR = df[variable].quantile(0.75) - df[variable].quantile(0.25)

    lower_boundary = df[variable].quantile(0.25) - (IQR * distance)
    upper_boundary = df[variable].quantile(0.75) + (IQR * distance)

    return upper_boundary, lower_boundary

Calculate and then display the IQR proximity rule boundaries for the RM variable:

upper_boundary, lower_boundary = find_boundaries(boston, 'RM', 1.5)
upper_boundary, lower_boundary

The find_boundaries() function returns the values above and below which we can consider a value to be an outlier, as shown here:

(7.730499999999999, 4.778500000000001)

If you want to find very extreme values, you can use 3 as the distance of find_boundaries() instead of 1.5.

Now, we need to find the outliers in the dataframe.

Create a boolean vector to flag observations outside the boundaries we determined in step 5:

outliers = np.where(boston['RM'] > upper_boundary, True,
            np.where(boston['RM'] < lower_boundary, True, False))

Create a new dataframe with the outlier values and then display the top five rows:

outliers_df = boston.loc[outliers, 'RM']
outliers_df.head()

We can see the top five outliers in the RM variable in the following output:

97     8.069
98     7.820
162    7.802
163    8.375
166    7.929
Name: RM, dtype: float64

To remove the outliers from the dataset, execute boston.loc[~outliers, 'RM'].

How it works...

In this recipe, we identified outliers in the numerical variables of the Boston House Prices dataset from scikit-learn using boxplots and the IQR proximity rule. To proceed with this recipe, we loaded the dataset from scikit-learn and created a boxplot for one of its numerical variables as an example. Next, we created a function to identify the boundaries using the IQR proximity rule and used the function to determine the boundaries of the numerical RM variable. Finally, we identified the values of RM that were higher or lower than those boundaries, that is, the outliers.

To load the data, we imported the dataset from sklearn.datasets and used load_boston(). Next, we captured the data in a dataframe using pandas DataFrame(), indicating that the data was stored in the data attribute and that the variable names were stored in the feature_names attribute. To retain only the RM, LSTAT, and CRIM variables, we passed the column names in double brackets [[]] at the back of pandas DataFrame().

To display the boxplot, we used seaborn's boxplot() method and passed the pandas Series with the RM variable as an argument. In the boxplot displayed after step 3, the IQR is delimited by the rectangle, and the upper and lower boundaries corresponding to either, the 75th quantile plus 1.5 times the IQR, or the 25th quantile minus 1.5 times the IQR. This is indicated by the whiskers. The outliers are the asterisks lying outside the whiskers.

To identify those outliers in our dataframe, in step 4, we created a function to find the boundaries according to the IQR proximity rule. The function took the dataframe and the variable as arguments and calculated the IQR and the boundaries using the formula described in the Getting ready section of this recipe. With the pandas quantile() method, we calculated the values for the 25th (0.25) and 75th quantiles (0.75). The function returned the upper and lower boundaries for the RM variable.

To find the outliers of RM, we used NumPy's where() method, which produced a boolean vector with True if the value was an outlier. Briefly, where() scanned the rows of the RM variable, and if the value was bigger than the upper boundary, it assigned True, whereas if the value was smaller, the second where() nested inside the first one and checked whether the value was smaller than the lower boundary, in which case it also assigned True, otherwise False. Finally, we used the loc[] method from pandas to capture only those values in the RM variable that were outliers in a new dataframe.