You're reading from Essential Statistics for Non-STEM Data Analysts Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

Product type Paperback

Published in Nov 2020

Publisher Packt

ISBN-13 9781838984847

Length 392 pages

Edition 1st Edition

Languages

Python

Concepts

Data Science

Author (1):

Rongpeng Li

View More author details

Table of Contents (19) Chapters

Preface

1. Section 1: Getting Started with Statistics for Data Science

2. Chapter 1: Fundamentals of Data Collection, Cleaning, and Preprocessing FREE CHAPTER

3. Chapter 2: Essential Statistics for Data Assessment

4. Chapter 3: Visualization with Statistical Graphs

5. Section 2: Essentials of Statistical Analysis

6. Chapter 4: Sampling and Inferential Statistics

7. Chapter 5: Common Probability Distributions

8. Chapter 6: Parametric Estimation

9. Chapter 7: Statistical Hypothesis Testing

10. Section 3: Statistics for Machine Learning

11. Chapter 8: Statistics for Regression

12. Chapter 9: Statistics for Classification

13. Chapter 10: Statistics for Tree-Based Methods

14. Chapter 11: Statistics for Ensemble Methods

15. Section 4: Appendix

16. Chapter 12: A Collection of Best Practices

17. Chapter 13: Exercises and Projects

18. Other Books You May Enjoy

Leave a review - let other readers know what you think

Learning about variance, standard deviation, quartiles, percentiles, and skewness

In the previous section, we studied the mean, median, and mode. They all describe, to a certain degree, the properties of the central part of the dataset. In this section, we will learn how to describe the spreading behavior of data.

Variance

With the same notation, variance for the population is defined as follows:

Intuitively, the further away the elements are from the mean, the larger the variance. Here, I plotted the histogram of two datasets with different variances. The one on the left subplot has a variance of 0.09 and the one on the right subplot has a variance of 0.009, 10 times smaller.

The following code snippet generates samples from the two distributions and plots them:

r1 = [random.normalvariate(0.5,0.3) for _ in range(10000)]
r2 = [random.normalvariate(0.5,0.1) for _ in range(10000)]
fig, axes = plt.subplots(1,2,figsize=(12,5))
axes[0].hist(r1,bins=100)
axes[0].set_xlim([-1,2])
axes[0].set_title("Big Variance")
axes[1].hist(r2,bins=100)
axes[1].set_title("Small Variance")
axes[1].set_xlim([-1,2]);

The results appear as follows:

Figure 2.6 – Big and small variances with the same mean at 0.5

The following code snippet generates a scatter plot that will demonstrate the difference more clearly. The variable on the x axis spreads more widely:

plt.figure(figsize=(8,8))
plt.scatter(r1,r2,alpha=0.2)
plt.xlim(-1,2)
plt.ylim(-1,2)
plt.xlabel("Big Variance Variable")
plt.ylabel("Small Variance Variable")
plt.title("Variables With Different Variances");

The result looks as follows:

Figure 2.7 – Scatter plot of large-variance and small-variance variables

The spread in the x axis is significantly larger than the spread in the y axis, which indicates the differences in variance magnitude. A common mistake is not getting the range correct. Matplotlib will, by default, try to determine the ranges. You need to use a code such as plt.xlim() to force it, otherwise the result is misleading.

Standard deviation

Standard deviation is the square root of the variance. It is used more commonly to measure the level of dispersion since it has the same unit as the original data. The formula for the standard deviation of a population reads as follows:

Standard deviation is extremely important in scientific graphing. A standard deviation is often plotted together with the data and represents an estimate of variability.

For this chapter, I will be using the net immigration rate for Texas from 2011 to 2018 as an example. In the following code snippet, I will first extract the county-level data, append the means and standard deviations to a list, and then plot them at the end. The standard deviation is obtained using numpy.std() and the error bar is plotted using matplotlib.pyplot.errorbar():

dfTX = df[df["State"]=="TX"].tail(-1)
YEARS = [year for year in range(2011,2019)]
MEANS = []
STDS = []
for i in range(2011,2019):
    year = "R_NET_MIG_"+str(i)
    MEANS.append(np.mean(dfTX[year]))
    STDS.append(np.std(dfTX[year]))
plt.figure(figsize=(10,8))
plt.errorbar(YEARS,MEANS,yerr=STDS)
plt.xlabel("Year")
plt.ylabel("Net Immigration Rate");

The output appears as shown in the following figure:

Figure 2.8 – Net immigration rate across counties in Texas from 2011 to 2018

We can see in Figure 2.8 that although the net immigration in Texas is only slightly positive, the standard deviation is huge. Some counties may have a big positive net rate, while others may potentially suffer from the loss of human resources.

Quartiles

Quartiles are a special kind of quantile that divide data into a number of equal portions. For example, quartiles divide data into four equal parts with the ½ quartile as the median. Deciles and percentiles divide data into 10 and 100 equal parts, respectively.

The first quartile, also known as the lower quartile takes the value such that 25% of all the data lies below it. The second quartile is the median. The third quartile, , is also known as the upper quartile and 25% of all values lie above it.

Quartiles are probably the most commonly used quantiles because they are associated with a statistical graph called a boxplot. Let's use the same set of Texas net immigration data to study it.

The function in NumPy is quantile() and we specify a list of quantiles as an argument for the quantiles we want to calculate, as in the following single-line code snippet:

np.quantile(dfTX["R_NET_MIG_2013"],[0.25,0.5,0.75])

The output reads as follows:

array([-7.83469971,  0.87919226,  8.84040759])

The following code snippet visualizes the quartiles:

plt.figure(figsize=(12,5))
plt.hist(dfTX["R_NET_MIG_2013"],bins=50,alpha=0.6)
for quartile in np.quantile(dfTX["R_NET_MIG_2013"],[0.25,0.5,0.75]):
plt.axvline(quartile,linestyle=':',linewidth=4)

As you can see from the following output, the vertical dotted lines indicate the three quartiles:

Figure 2.9 – Quartiles of the net immigration data in 2013

The lower and upper quartiles keep exactly 50% of the data values in between. is referred to as the interquartile range called Interquartile Range (IQR) and it plays an important role in outlier detection. We will see more about this soon.

Skewness

Skewness differs from the three measures of variability we discussed in the previous subsections. It measures the direction the data takes and the extent to which the data distribution tilts. Skewness is given as shown in the following equation:

Various definitions of skewness

The skewness we defined earlier is precisely referred to as Pearson's first skewness coefficient. It is defined through the mode, but there are other definitions of skewness. For example, skewness can be defined through the median.

Skewness is unitless. If the mean is larger than the mode, skewness is positive, and we say the data is skewed to the right. Otherwise, the data is skewed to the left.

Here is the code snippet that generates two sets of skewed data and plots them:

r1 = [random.normalvariate(0.5,0.4) for _ in range(10000)]
r2 = [random.normalvariate(0.1,0.2) for _ in range(10000)]
r3 = [random.normalvariate(1.1,0.2) for _ in range(10000)]
fig, axes = plt.subplots(1,2,figsize=(12,5))
axes[0].hist(r1+r2,bins=100,alpha=0.5)
axes[0].axvline(np.mean(r1+r2), linestyle=':',linewidth=4)
axes[0].set_title("Skewed To Right")
axes[1].hist(r1+r3,bins=100,alpha=0.5)
axes[1].axvline(np.mean(r1+r3),linestyle=':',linewidth=4)
axes[1].set_title("Skewed to Left");

The vertical dotted line indicates the position of the mean as follows:

Figure 2.10 – Skewness demonstration

Think about the problem of income inequality. Let's say you have a plot of the histogram of the population with different amounts of wealth. A larger value just like where the x axis value indicates the amount of wealth and the y axis value indicates the portion of the population that falls into a certain wealth amount range. A larger x value means more wealth. A larger y value means a greater percentage of the population falls into that range of wealth possession. Positive skewness (the left subplot in Figure 2.10) means that even though the average income looks good, this may be driven up by a very small number of super rich individuals when the majority of people earn a relatively small income. Negative skewness (the right subplot in Figure 2.10) indicates that the majority may have an income above the mean value, so there might be some very poor people who may need help.

A revisit of outlier detection

Now, let's use what we have learned to revisit the outlier detection problem.

The z-score, also known as the standard score, is a good criterion for detecting outliers. It measures the distance between an entry and the population mean, taking the population variance into consideration:

If the underlying distribution is normal, a situation where a z-score is greater than 3 or less than 0 only has a probability of roughly 0.27%. Even if the underlying distribution is not normal, Chebyshev's theorem guarantees a strong claim such that at most 1\k^2, where k is an integer, of the total population can fall outside k standard deviations.

As an example, the following code snippet generates 10,000 data points that follow a normal distribution:

random.seed(2020)
x = [random.normalvariate(1, 0.5) for _ in range(10000)]
plt.figure(figsize=(10,8))
plt.hist(x,bins=100,alpha=0.5);
styles = [":","--","-."]
for i in range(3):
    plt.axvline(np.mean(x) + (i+1)*np.std(x),
                linestyle=styles[i],
                linewidth=4)
    plt.axvline(np.mean(x) - (i+1)*np.std(x),
                linestyle=styles[i],
                linewidth=4)
plt.title("Integer Z values for symmetric distributions");

In the generated histogram plot, the dotted line indicates the location where . The dashed line indicates the location of . The dashed dotted line indicates the location of :

Figure 2.11 – Integer z value boundaries for normally distributed symmetric data

If we change the data points, the distribution will change, but the z-score criteria will remain valid. As you can see in the following code snippet, an asymmetric distribution is generated rather than a normal distribution:

x = [random.normalvariate(1, 0.5) + random.expovariate(2) for _ in range(10000)]

This produces the following output:

Figure 2.12 – Integer z value boundaries for asymmetric data

Note on the influence of extreme outliers

A drawback of the z-score is that the mean itself is also influenced by extreme outliers. The median can replace a mean to remove this effect. It is flexible to set different criteria in different production cases.

We have covered several of the most important statistics to model variances in a dataset. In the next section, let's work on the data types of features.

The rest of the chapter is locked

You're reading from Essential Statistics for Non-STEM Data Analysts Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

Table of Contents (19) Chapters

Learning about variance, standard deviation, quartiles, percentiles, and skewness

Variance

Standard deviation

Quartiles

Skewness

Authors (1)

Other recommended products

Personalised recommendations for you

You're reading from Essential Statistics for Non-STEM Data Analysts Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

Table of Contents (19) Chapters

Learning about variance, standard deviation, quartiles, percentiles, and skewness

Variance

Standard deviation

Quartiles

Skewness

Unlock this book and the full library FREE for 7 days

Authors (1)

Other recommended products

Personalised recommendations for you