You're reading from Data Cleaning and Exploration with Machine Learning Get to grips with machine learning techniques to achieve sparkling-clean data quickly

Product type Paperback

Published in Aug 2022

Publisher Packt

ISBN-13 9781803241678

Length 542 pages

Edition 1st Edition

Concepts

Machine Learning

Author (1):

Michael Walker

View More author details

Table of Contents (23) Chapters

Preface

1. Section 1 – Data Cleaning and Machine Learning Algorithms

2. Chapter 1: Examining the Distribution of Features and Targets FREE CHAPTER

3. Chapter 2: Examining Bivariate and Multivariate Relationships between Features and Targets

4. Chapter 3: Identifying and Fixing Missing Values

5. Section 2 – Preprocessing, Feature Selection, and Sampling

6. Chapter 4: Encoding, Transforming, and Scaling Features

7. Chapter 5: Feature Selection

8. Chapter 6: Preparing for Model Evaluation

9. Section 3 – Modeling Continuous Targets with Supervised Learning

10. Chapter 7: Linear Regression Models

11. Chapter 8: Support Vector Regression

12. Chapter 9: K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression

13. Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning

14. Chapter 10: Logistic Regression

15. Chapter 11: Decision Trees and Random Forest Classification

16. Chapter 12: K-Nearest Neighbors for Classification

17. Chapter 13: Support Vector Machine Classification

18. Chapter 14: Naïve Bayes Classification

19. Section 5 – Clustering and Dimensionality Reduction with Unsupervised Learning

20. Chapter 15: Principal Component Analysis

21. Chapter 16: K-Means and DBSCAN Clustering

22. Other Books You May Enjoy

Identifying extreme values and outliers in univariate analysis

An outlier can be thought of as an observation with feature values, or relationships between feature values, that are so unusual that they cannot help explain relationships in the rest of the data. This matters for modeling because we cannot assume that the outliers will have a neutral impact on our parameter estimates. Sometimes, our models work so hard to construct parameter estimates that can account for patterns in outlier observations that we compromise the model's explanatory or predictive power for all other observations. Raise your hand if you have ever spent days trying to interpret a model only to discover that your coefficients and predictions completely changed once you removed a few outliers.

I should quickly add that there is no agreed-upon definition of an outlier. I offer the preceding definition for use in this book because it helps us distinguish between outliers, as I have described them, and extreme values. There is a fair bit of overlap between the two, but many extreme values are not outliers. This is because such values reflect a natural and explainable trend in a feature, or because they reflect the same relationship between features as is observed throughout the data. The reverse is also true. Some outliers are not extreme values. For example, a target value might be right in the middle of the distribution but have quite unexpected predictor values.

For our modeling, then, it is hard to say that a particular feature or target value is an outlier without referencing multivariate relationships. But it should at least raise a red flag when, in our univariate analysis, we see values well to the left or right of the center. This should prompt us to investigate the observation at that value further, including examining the values of other features. We will look at multivariate relationships in more detail in the next two chapters. Here, and in the next section on visualizations, we will focus on identifying extreme values and outliers when looking at one variable.

A good starting point for identifying an extreme value is to look at its distance from the middle of the distribution. One common method for doing that is to calculate each value's distance from the interquartile range (IQR), which is the distance between the first quartile value and the third quartile value. We often flag any value that is more than 1.5 times the interquartile range above the third quartile or below the first quartile. We can use this method to identify outliers in the COVID-19 data.

Let's get started:

Let's start by importing the libraries we will need. In addition to pandas and NumPy, we will use Matplotlib and statsmodels for the plots we will create. We will also load the COVID data and select the variables we need:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
covidtotals = pd.read_csv("data/covidtotals.csv")
covidtotals.set_index("iso_code", inplace=True)
keyvars = ['location','total_cases_mill','total_deaths_mill',
  'aged_65_older','diabetes_prevalence','gdp_per_capita']
covidkeys = covidtotals[keyvars]

Let's take a look at total_cases_mill. We get the first and third quartile values and calculate the interquartile range, 1.5*(thirdq-firstq). Then, we calculate their thresholds to determine the high and low extreme values, which are interquartilerange+thirdq and firstq-interquartilerange, respectively (if you are familiar with boxplots, you will notice that this is the same calculation that's used for the whiskers of a boxplot; we will cover boxplots in the next section):
```
thirdq, firstq = covidkeys.total_cases_mill.quantile(0.75), covidkeys.total_cases_mill.quantile(0.25)
interquartilerange = 1.5*(thirdq-firstq)
extvalhigh, extvallow = interquartilerange+thirdq, firstq-interquartilerange
print(extvallow, extvalhigh, sep=" <--> ")
-91002.564625 <--> 158336.930375
```

This calculation indicates that any value for total_cases_mill that's above 158,337 can be considered extreme. We can ignore extreme values on the low end because they would be negative:

covidtotals.loc[covidtotals.total_cases_mill>extvalhigh].T
iso_code                   AND         MNE         SYC
lastdate            2021-07-07  2021-07-07  2021-07-07
location            Andorra     Montenegro  Seychelles
total_cases          14,021        100,392      16,304
total_deaths            127          1,619          71
total_cases_mill    181,466        159,844     165,792
total_deaths_mill     1,644          2,578         722
population           77,265        628,062      98,340
population_density      164             46         208
median_age              NaN             39          36
gdp_per_capita          NaN         16,409      26,382
aged_65_older           NaN             15           9
total_tests_thous       NaN            NaN         NaN
life_expectancy          84             77          73
hospital_beds_thous     NaN              4           4
diabetes_prevalence       8             10          11
region              Western        Eastern        East 
                     Europe         Europe      Africa

Andorra, Montenegro, and Seychelles all have total_cases_mill above the threshold amount. This invites us to explore other ways these countries might be exceptional, and whether our features can capture that. We will not dive deeply into multivariate analysis here since we will do that in the next chapter, but it is a good idea to start wrapping our brains around why these extreme values may or may not make sense. Having some means across the full dataset may help us here:

covidtotals.mean()
total_cases               963,933
total_deaths              21,631
total_cases_mill          36,649
total_deaths_mill         683
population                35,134,957
population_density        453
median_age                30
gdp_per_capita            19,141
aged_65_older             9
total_tests_thous         535
life_expectancy           73
hospital_beds_thous       3
diabetes_prevalence       8

The main difference between these three countries and others is that they have very low populations. Surprisingly, each has a much lower population density than average. That is the opposite of what you would expect and merits further consideration in the analysis we will do throughout this book.

An Alternative to the IQR Calculation

An alternative to using the interquartile range to identify an extreme value would be to use several standard deviations away from the mean, say 3. One drawback of this method is that it is a little more susceptible to extreme values than using the interquartile range.

I find it helpful to produce this kind of analysis for all the key targets and features in my data, so let's automate this method of identifying extreme values. We should also output the results to a file so that we can use them when we need them:

Let's define a function, getextremevalues, that iterates over all of the columns of our DataFrame (except for the first one, which contains the location column), calculates the interquartile range for that column, selects all the observations with values above the high threshold or below the low threshold for that column, and then appends the results to a new DataFrame (dfout):

def getextremevalues(dfin):
      dfout = pd.DataFrame(columns=dfin.columns, 
                           data=None)
      for col in dfin.columns[1:]:
        thirdq, firstq = dfin[col].quantile(0.75), \
          dfin[col].quantile(0.25)
        interquartilerange = 1.5*(thirdq-firstq)
        extvalhigh, extvallow = \
          interquartilerange+thirdq, \
          firstq-interquartilerange
        df = dfin.loc[(dfin[col]>extvalhigh) | (dfin[col]<extvallow)]
        df = df.assign(varname = col,
          threshlow = extvallow,
          threshhigh = extvalhigh)
        dfout = pd.concat([dfout, df])
    return dfout

Now, we can pass our covidkeys DataFrame to the getextremevalues function to get a DataFrame that contains the extreme values for each column. Then, we can display the number of extreme values for each column, which tells us that there were four extreme values for the total deaths per million people in the population (total_deaths_mill). Now, we can output the data to an Excel file:
```
extremevalues = getextremevalues(covidkeys)
extremevalues.varname.value_counts()
gdp_per_capita         9
diabetes_prevalence    8
total_deaths_mill      4
total_cases_mill       3
Name: varname, dtype: int64
extremevalues.to_excel("views/extremevaluescases.xlsx")
```

Let's take a closer look at the extreme deaths per million values. We can query the DataFrame we just created to get the threshhigh value for total_deaths_mill, which is 2654. We can also get other key features for those countries with the extreme values since we have included that data in the new DataFrame:

extremevalues.loc[extremevalues.varname=="total_deaths_mill",
    'threshhigh'][0]
2653.752
extremevalues.loc[extremevalues.varname=="total_deaths_mill",
      keyvars].sort_values(['total_deaths_mill'], ascending=False)
      location              total_cases_mill  total_deaths_mill
PER   Peru                    62,830            5,876
HUN   Hungary                 83,676            3,105
BIH   Bosnia and Herzegovina  62,499            2,947
CZE   Czechia                 155,783           2,830
     _65_older  diabetes_prevalence  gdp_per_capita  
PER      7              6                12,237
HUN     19              8                26,778
BIH     17             10                11,714
CZE     19              7                32,606

Peru, Hungary, Bosnia and Herzegovina, and Czechia have total_deaths_mill above the extreme value threshold. One thing that stands out for three of these countries is how much above the average for percent of population 65 or older they are as well (the average for that feature, as we displayed in a preceding table, is 9). Although these are extreme values for deaths, the relationship between the elderly percentage of the population and COVID deaths may account for much of this and can do so without overfitting the model to these extreme cases. We will go through some strategies for teasing that out in the next chapter.

So far, we have discussed outliers and extreme values without referencing distribution shape. What we have implied so far is that an extreme value is a rare value – significantly rarer than the values near the center of the distribution. But this makes the most sense when the feature's distribution approaches normal. If, on the other hand, a feature had a uniform distribution, a very high value would be no rarer than any other value.

In practice, then, we think about extreme values or outliers relative to the distribution of the feature. Quantile-quantile (Q-Q) plots can improve our sense of that distribution by allowing us to view it graphically relative to a theoretical distribution: normal, uniform, log, or others. Let's take a look:

Let's create a Q-Q plot of total cases per million that's relative to the normal distribution. We can use the statsmodels library for this:
```
sm.qqplot(covidtotals[['total_cases_mill']]. \
  sort_values(['total_cases_mill']).dropna(),line='s')
)
plt.show()
```

This produces the following plot:

Figure 1.1 – Q-Q plot of total cases per million

This Q-Q plot makes it clear that the distribution of total cases across countries is not normal. We can see this by how much the data points deviate from the red line. It is a Q-Q plot that we would expect from a distribution with some positive skew. This is consistent with the summary statistics we have already calculated for the total cases feature. It further reinforces our developing sense, in that we will need to be cautious about parametric models and that we will probably have to account for more than just one or two outlier observations.

Let's look at a Q-Q plot for a feature with a distribution that is a little closer to normal. There isn't a great candidate in the COVID data, so we will work with data from the United States National Oceanic and Atmospheric Administration on land temperatures in 2019.

Data Note

The land temperature dataset contains the average temperature readings (in Celsius) in 2019 from over 12,000 stations across the world, though the majority of the stations are in the United States. The raw data was retrieved from the Global Historical Climatology Network integrated database. It has been made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-monthly-version-2.

First, let's load the data into a pandas DataFrame and run some descriptive statistics on the temperature feature, avgtemp. We must add a few percentile statistics to the normal describe output to get a better sense of the range of values. avgtemp is the average temperature for the year at each of the 12,095 weather stations. The average temperature across all stations was 11.2 degrees Celsius. The median was 10.4. However, there were some very negative values, including 14 weather stations with an average temperature of less than -25. This contributes to a moderately negative skew, though both the skew and kurtosis are closer to what would be expected from a normal distribution:

landtemps = pd.read_csv("data/landtemps2019avgs.csv")
landtemps.avgtemp.describe(percentiles=[0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95])
count        12,095.0
mean             11.2
std               8.6
min             -60.8
5%               -0.7
10%               1.7
25%               5.4
50%              10.4
75%              16.9
90%              23.1
95%              27.0
max              33.9
Name: avgtemp, dtype: float64
 landtemps.loc[landtemps.avgtemp<-25,'avgtemp'].count()
14
landtemps.avgtemp.skew()
-0.2678382583481768
landtemps.avgtemp.kurtosis()
2.1698313707061073

Now, let's take a look at a Q-Q plot of the average temperature:

sm.qqplot(landtemps.avgtemp.sort_values().dropna(), line='s')
plt.tight_layout()
plt.show()

This produces the following plot:

Figure 1.2 – Q-Q plot of average temperatures

Along most of the range, the distribution of average temperatures looks pretty close to normal. The exceptions are the extremely low temperatures, contributing to a small amount of negative skew. There is also some deviation from normal at the high end, though this is much less of an issue (you may have noticed that Q-Q plots for features with negative skew have an umbrella-like shape, while those with positive skews, such as total cases, have more of a bowl-like shape).

We are off to a good start in our efforts to understand the distribution of possible features and targets and, to a related effort, to identify extreme values and outliers. This is important information to have at our fingertips when we construct, refine, and interpret our models. But there is more that we can do to improve our intuition about the data. A good next step is to construct visualizations of key features.

You're reading from Data Cleaning and Exploration with Machine Learning Get to grips with machine learning techniques to achieve sparkling-clean data quickly

Table of Contents (23) Chapters

Identifying extreme values and outliers in univariate analysis

Authors (1)

Personalised recommendations for you