Identifying extreme values and outliers in univariate analysis
An outlier can be thought of as an observation with feature values, or relationships between feature values, that are so unusual that they cannot help explain relationships in the rest of the data. This matters for modeling because we cannot assume that the outliers will have a neutral impact on our parameter estimates. Sometimes, our models work so hard to construct parameter estimates that can account for patterns in outlier observations that we compromise the model's explanatory or predictive power for all other observations. Raise your hand if you have ever spent days trying to interpret a model only to discover that your coefficients and predictions completely changed once you removed a few outliers.
I should quickly add that there is no agreed-upon definition of an outlier. I offer the preceding definition for use in this book because it helps us distinguish between outliers, as I have described them, and extreme values. There is a fair bit of overlap between the two, but many extreme values are not outliers. This is because such values reflect a natural and explainable trend in a feature, or because they reflect the same relationship between features as is observed throughout the data. The reverse is also true. Some outliers are not extreme values. For example, a target value might be right in the middle of the distribution but have quite unexpected predictor values.
For our modeling, then, it is hard to say that a particular feature or target value is an outlier without referencing multivariate relationships. But it should at least raise a red flag when, in our univariate analysis, we see values well to the left or right of the center. This should prompt us to investigate the observation at that value further, including examining the values of other features. We will look at multivariate relationships in more detail in the next two chapters. Here, and in the next section on visualizations, we will focus on identifying extreme values and outliers when looking at one variable.
A good starting point for identifying an extreme value is to look at its distance from the middle of the distribution. One common method for doing that is to calculate each value's distance from the interquartile range (IQR), which is the distance between the first quartile value and the third quartile value. We often flag any value that is more than 1.5 times the interquartile range above the third quartile or below the first quartile. We can use this method to identify outliers in the COVID-19 data.
Let's get started:
- Let's start by importing the libraries we will need. In addition to pandas and NumPy, we will use Matplotlib and statsmodels for the plots we will create. We will also load the COVID data and select the variables we need:
import pandas as pd import numpy as np import matplotlib.pyplot as plt import statsmodels.api as sm covidtotals = pd.read_csv("data/covidtotals.csv") covidtotals.set_index("iso_code", inplace=True) keyvars = ['location','total_cases_mill','total_deaths_mill', 'aged_65_older','diabetes_prevalence','gdp_per_capita'] covidkeys = covidtotals[keyvars]
- Let's take a look at
total_cases_mill
. We get the first and third quartile values and calculate the interquartile range,1.5*(thirdq-firstq)
. Then, we calculate their thresholds to determine the high and low extreme values, which areinterquartilerange+thirdq
andfirstq-interquartilerange
, respectively (if you are familiar with boxplots, you will notice that this is the same calculation that's used for the whiskers of a boxplot; we will cover boxplots in the next section):thirdq, firstq = covidkeys.total_cases_mill.quantile(0.75), covidkeys.total_cases_mill.quantile(0.25) interquartilerange = 1.5*(thirdq-firstq) extvalhigh, extvallow = interquartilerange+thirdq, firstq-interquartilerange print(extvallow, extvalhigh, sep=" <--> ") -91002.564625 <--> 158336.930375
- This calculation indicates that any value for
total_cases_mill
that's above 158,337 can be considered extreme. We can ignore extreme values on the low end because they would be negative:covidtotals.loc[covidtotals.total_cases_mill>extvalhigh].T iso_code AND MNE SYC lastdate 2021-07-07 2021-07-07 2021-07-07 location Andorra Montenegro Seychelles total_cases 14,021 100,392 16,304 total_deaths 127 1,619 71 total_cases_mill 181,466 159,844 165,792 total_deaths_mill 1,644 2,578 722 population 77,265 628,062 98,340 population_density 164 46 208 median_age NaN 39 36 gdp_per_capita NaN 16,409 26,382 aged_65_older NaN 15 9 total_tests_thous NaN NaN NaN life_expectancy 84 77 73 hospital_beds_thous NaN 4 4 diabetes_prevalence 8 10 11 region Western Eastern East Europe Europe Africa
- Andorra, Montenegro, and Seychelles all have
total_cases_mill
above the threshold amount. This invites us to explore other ways these countries might be exceptional, and whether our features can capture that. We will not dive deeply into multivariate analysis here since we will do that in the next chapter, but it is a good idea to start wrapping our brains around why these extreme values may or may not make sense. Having some means across the full dataset may help us here:covidtotals.mean() total_cases 963,933 total_deaths 21,631 total_cases_mill 36,649 total_deaths_mill 683 population 35,134,957 population_density 453 median_age 30 gdp_per_capita 19,141 aged_65_older 9 total_tests_thous 535 life_expectancy 73 hospital_beds_thous 3 diabetes_prevalence 8
The main difference between these three countries and others is that they have very low populations. Surprisingly, each has a much lower population density than average. That is the opposite of what you would expect and merits further consideration in the analysis we will do throughout this book.
An Alternative to the IQR Calculation
An alternative to using the interquartile range to identify an extreme value would be to use several standard deviations away from the mean, say 3. One drawback of this method is that it is a little more susceptible to extreme values than using the interquartile range.
I find it helpful to produce this kind of analysis for all the key targets and features in my data, so let's automate this method of identifying extreme values. We should also output the results to a file so that we can use them when we need them:
- Let's define a function,
getextremevalues
, that iterates over all of the columns of our DataFrame (except for the first one, which contains the location column), calculates the interquartile range for that column, selects all the observations with values above the high threshold or below the low threshold for that column, and then appends the results to a new DataFrame (dfout
):def getextremevalues(dfin): dfout = pd.DataFrame(columns=dfin.columns, data=None) for col in dfin.columns[1:]: thirdq, firstq = dfin[col].quantile(0.75), \ dfin[col].quantile(0.25) interquartilerange = 1.5*(thirdq-firstq) extvalhigh, extvallow = \ interquartilerange+thirdq, \ firstq-interquartilerange df = dfin.loc[(dfin[col]>extvalhigh) | (dfin[col]<extvallow)] df = df.assign(varname = col, threshlow = extvallow, threshhigh = extvalhigh) dfout = pd.concat([dfout, df]) return dfout
- Now, we can pass our
covidkeys
DataFrame to thegetextremevalues
function to get a DataFrame that contains the extreme values for each column. Then, we can display the number of extreme values for each column, which tells us that there were four extreme values for the total deaths per million people in the population (total_deaths_mill
). Now, we can output the data to an Excel file:extremevalues = getextremevalues(covidkeys) extremevalues.varname.value_counts() gdp_per_capita 9 diabetes_prevalence 8 total_deaths_mill 4 total_cases_mill 3 Name: varname, dtype: int64 extremevalues.to_excel("views/extremevaluescases.xlsx")
- Let's take a closer look at the extreme deaths per million values. We can query the DataFrame we just created to get the
threshhigh
value fortotal_deaths_mill
, which is2654
. We can also get other key features for those countries with the extreme values since we have included that data in the new DataFrame:extremevalues.loc[extremevalues.varname=="total_deaths_mill", 'threshhigh'][0] 2653.752 extremevalues.loc[extremevalues.varname=="total_deaths_mill", keyvars].sort_values(['total_deaths_mill'], ascending=False) location total_cases_mill total_deaths_mill PER Peru 62,830 5,876 HUN Hungary 83,676 3,105 BIH Bosnia and Herzegovina 62,499 2,947 CZE Czechia 155,783 2,830 _65_older diabetes_prevalence gdp_per_capita PER 7 6 12,237 HUN 19 8 26,778 BIH 17 10 11,714 CZE 19 7 32,606
Peru, Hungary, Bosnia and Herzegovina, and Czechia have total_deaths_mill
above the extreme value threshold. One thing that stands out for three of these countries is how much above the average for percent of population 65 or older they are as well (the average for that feature, as we displayed in a preceding table, is 9). Although these are extreme values for deaths, the relationship between the elderly percentage of the population and COVID deaths may account for much of this and can do so without overfitting the model to these extreme cases. We will go through some strategies for teasing that out in the next chapter.
So far, we have discussed outliers and extreme values without referencing distribution shape. What we have implied so far is that an extreme value is a rare value – significantly rarer than the values near the center of the distribution. But this makes the most sense when the feature's distribution approaches normal. If, on the other hand, a feature had a uniform distribution, a very high value would be no rarer than any other value.
In practice, then, we think about extreme values or outliers relative to the distribution of the feature. Quantile-quantile (Q-Q) plots can improve our sense of that distribution by allowing us to view it graphically relative to a theoretical distribution: normal, uniform, log, or others. Let's take a look:
- Let's create a Q-Q plot of total cases per million that's relative to the normal distribution. We can use the
statsmodels
library for this:sm.qqplot(covidtotals[['total_cases_mill']]. \ sort_values(['total_cases_mill']).dropna(),line='s') ) plt.show()
This produces the following plot:
This Q-Q plot makes it clear that the distribution of total cases across countries is not normal. We can see this by how much the data points deviate from the red line. It is a Q-Q plot that we would expect from a distribution with some positive skew. This is consistent with the summary statistics we have already calculated for the total cases feature. It further reinforces our developing sense, in that we will need to be cautious about parametric models and that we will probably have to account for more than just one or two outlier observations.
Let's look at a Q-Q plot for a feature with a distribution that is a little closer to normal. There isn't a great candidate in the COVID data, so we will work with data from the United States National Oceanic and Atmospheric Administration on land temperatures in 2019.
Data Note
The land temperature dataset contains the average temperature readings (in Celsius) in 2019 from over 12,000 stations across the world, though the majority of the stations are in the United States. The raw data was retrieved from the Global Historical Climatology Network integrated database. It has been made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-monthly-version-2.
- First, let's load the data into a pandas DataFrame and run some descriptive statistics on the temperature feature,
avgtemp
. We must add a few percentile statistics to the normaldescribe
output to get a better sense of the range of values.avgtemp
is the average temperature for the year at each of the 12,095 weather stations. The average temperature across all stations was 11.2 degrees Celsius. The median was 10.4. However, there were some very negative values, including 14 weather stations with an average temperature of less than -25. This contributes to a moderately negative skew, though both the skew and kurtosis are closer to what would be expected from a normal distribution:landtemps = pd.read_csv("data/landtemps2019avgs.csv") landtemps.avgtemp.describe(percentiles=[0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95]) count 12,095.0 mean 11.2 std 8.6 min -60.8 5% -0.7 10% 1.7 25% 5.4 50% 10.4 75% 16.9 90% 23.1 95% 27.0 max 33.9 Name: avgtemp, dtype: float64 landtemps.loc[landtemps.avgtemp<-25,'avgtemp'].count() 14 landtemps.avgtemp.skew() -0.2678382583481768 landtemps.avgtemp.kurtosis() 2.1698313707061073
- Now, let's take a look at a Q-Q plot of the average temperature:
sm.qqplot(landtemps.avgtemp.sort_values().dropna(), line='s') plt.tight_layout() plt.show()
This produces the following plot:
Along most of the range, the distribution of average temperatures looks pretty close to normal. The exceptions are the extremely low temperatures, contributing to a small amount of negative skew. There is also some deviation from normal at the high end, though this is much less of an issue (you may have noticed that Q-Q plots for features with negative skew have an umbrella-like shape, while those with positive skews, such as total cases, have more of a bowl-like shape).
We are off to a good start in our efforts to understand the distribution of possible features and targets and, to a related effort, to identify extreme values and outliers. This is important information to have at our fingertips when we construct, refine, and interpret our models. But there is more that we can do to improve our intuition about the data. A good next step is to construct visualizations of key features.