Learning about variance, standard deviation, quartiles, percentiles, and skewness
In the previous section, we studied the mean, median, and mode. They all describe, to a certain degree, the properties of the central part of the dataset. In this section, we will learn how to describe the spreading behavior of data.
Variance
With the same notation, variance for the population is defined as follows:

Intuitively, the further away the elements are from the mean, the larger the variance. Here, I plotted the histogram of two datasets with different variances. The one on the left subplot has a variance of 0.09 and the one on the right subplot has a variance of 0.009, 10 times smaller.
The following code snippet generates samples from the two distributions and plots them:
r1 = [random.normalvariate(0.5,0.3) for _ in range(10000)] r2 = [random.normalvariate(0.5,0.1) for _ in range(10000)] fig, axes = plt.subplots(1,2,figsize=(12,5)) axes[0].hist(r1,bins=100) axes[0].set_xlim([-1,2]) axes[0].set_title("Big Variance") axes[1].hist(r2,bins=100) axes[1].set_title("Small Variance") axes[1].set_xlim([-1,2]);
The results appear as follows:

Figure 2.6 – Big and small variances with the same mean at 0.5
The following code snippet generates a scatter plot that will demonstrate the difference more clearly. The variable on the x axis spreads more widely:
plt.figure(figsize=(8,8)) plt.scatter(r1,r2,alpha=0.2) plt.xlim(-1,2) plt.ylim(-1,2) plt.xlabel("Big Variance Variable") plt.ylabel("Small Variance Variable") plt.title("Variables With Different Variances");
The result looks as follows:

Figure 2.7 – Scatter plot of large-variance and small-variance variables
The spread in the x axis is significantly larger than the spread in the y axis, which indicates the differences in variance magnitude. A common mistake is not getting the range correct. Matplotlib will, by default, try to determine the ranges. You need to use a code such as plt.xlim()
to force it, otherwise the result is misleading.
Standard deviation
Standard deviation is the square root of the variance. It is used more commonly to measure the level of dispersion since it has the same unit as the original data. The formula for the standard deviation of a population reads as follows:

Standard deviation is extremely important in scientific graphing. A standard deviation is often plotted together with the data and represents an estimate of variability.
For this chapter, I will be using the net immigration rate for Texas from 2011 to 2018 as an example. In the following code snippet, I will first extract the county-level data, append the means and standard deviations to a list, and then plot them at the end. The standard deviation is obtained using numpy.std()
and the error bar is plotted using matplotlib.pyplot.errorbar()
:
dfTX = df[df["State"]=="TX"].tail(-1) YEARS = [year for year in range(2011,2019)] MEANS = [] STDS = [] for i in range(2011,2019): Â Â Â Â year = "R_NET_MIG_"+str(i) Â Â Â Â MEANS.append(np.mean(dfTX[year])) Â Â Â Â STDS.append(np.std(dfTX[year])) plt.figure(figsize=(10,8)) plt.errorbar(YEARS,MEANS,yerr=STDS) plt.xlabel("Year") plt.ylabel("Net Immigration Rate");
The output appears as shown in the following figure:

Figure 2.8 – Net immigration rate across counties in Texas from 2011 to 2018
We can see in Figure 2.8 that although the net immigration in Texas is only slightly positive, the standard deviation is huge. Some counties may have a big positive net rate, while others may potentially suffer from the loss of human resources.
Quartiles
Quartiles are a special kind of quantile that divide data into a number of equal portions. For example, quartiles divide data into four equal parts with the ½ quartile as the median. Deciles and percentiles divide data into 10 and 100 equal parts, respectively.
The first quartile, also known as the lower quartile takes the value such that 25% of all the data lies below it. The second quartile is the median. The third quartile,
, is also known as the upper quartile and 25% of all values lie above it.
Quartiles are probably the most commonly used quantiles because they are associated with a statistical graph called a boxplot. Let's use the same set of Texas net immigration data to study it.
The function in NumPy
is quantile()
and we specify a list of quantiles as an argument for the quantiles we want to calculate, as in the following single-line code snippet:
np.quantile(dfTX["R_NET_MIG_2013"],[0.25,0.5,0.75])
The output reads as follows:
array([-7.83469971,  0.87919226,  8.84040759])
The following code snippet visualizes the quartiles:
plt.figure(figsize=(12,5)) plt.hist(dfTX["R_NET_MIG_2013"],bins=50,alpha=0.6) for quartile in np.quantile(dfTX["R_NET_MIG_2013"],[0.25,0.5,0.75]): plt.axvline(quartile,linestyle=':',linewidth=4)
As you can see from the following output, the vertical dotted lines indicate the three quartiles:

Figure 2.9 – Quartiles of the net immigration data in 2013
The lower and upper quartiles keep exactly 50% of the data values in between. is referred to as the interquartile range called Interquartile Range (IQR) and it plays an important role in outlier detection. We will see more about this soon.
Skewness
Skewness differs from the three measures of variability we discussed in the previous subsections. It measures the direction the data takes and the extent to which the data distribution tilts. Skewness is given as shown in the following equation:
Various definitions of skewness
The skewness we defined earlier is precisely referred to as Pearson's first skewness coefficient. It is defined through the mode, but there are other definitions of skewness. For example, skewness can be defined through the median.
Skewness is unitless. If the mean is larger than the mode, skewness is positive, and we say the data is skewed to the right. Otherwise, the data is skewed to the left.
Here is the code snippet that generates two sets of skewed data and plots them:
r1 = [random.normalvariate(0.5,0.4) for _ in range(10000)] r2 = [random.normalvariate(0.1,0.2) for _ in range(10000)] r3 = [random.normalvariate(1.1,0.2) for _ in range(10000)] fig, axes = plt.subplots(1,2,figsize=(12,5)) axes[0].hist(r1+r2,bins=100,alpha=0.5) axes[0].axvline(np.mean(r1+r2), linestyle=':',linewidth=4) axes[0].set_title("Skewed To Right") axes[1].hist(r1+r3,bins=100,alpha=0.5) axes[1].axvline(np.mean(r1+r3),linestyle=':',linewidth=4) axes[1].set_title("Skewed to Left");
The vertical dotted line indicates the position of the mean as follows:

Figure 2.10 – Skewness demonstration
Think about the problem of income inequality. Let's say you have a plot of the histogram of the population with different amounts of wealth. A larger value just like where the x axis value indicates the amount of wealth and the y axis value indicates the portion of the population that falls into a certain wealth amount range. A larger x value means more wealth. A larger y value means a greater percentage of the population falls into that range of wealth possession. Positive skewness (the left subplot in Figure 2.10) means that even though the average income looks good, this may be driven up by a very small number of super rich individuals when the majority of people earn a relatively small income. Negative skewness (the right subplot in Figure 2.10) indicates that the majority may have an income above the mean value, so there might be some very poor people who may need help.
A revisit of outlier detection
Now, let's use what we have learned to revisit the outlier detection problem.
The z-score, also known as the standard score, is a good criterion for detecting outliers. It measures the distance between an entry and the population mean, taking the population variance into consideration:

If the underlying distribution is normal, a situation where a z-score is greater than 3 or less than 0 only has a probability of roughly 0.27%. Even if the underlying distribution is not normal, Chebyshev's theorem guarantees a strong claim such that at most 1\k^2, where k is an integer, of the total population can fall outside k standard deviations.
As an example, the following code snippet generates 10,000 data points that follow a normal distribution:
random.seed(2020) x = [random.normalvariate(1, 0.5) for _ in range(10000)] plt.figure(figsize=(10,8)) plt.hist(x,bins=100,alpha=0.5); styles = [":","--","-."] for i in range(3): Â Â Â Â plt.axvline(np.mean(x) + (i+1)*np.std(x), Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â linestyle=styles[i], Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â linewidth=4) Â Â Â Â plt.axvline(np.mean(x) - (i+1)*np.std(x), Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â linestyle=styles[i], Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â linewidth=4) plt.title("Integer Z values for symmetric distributions");
In the generated histogram plot, the dotted line indicates the location where . The dashed line indicates the location of
. The dashed dotted line indicates the location of
:

Figure 2.11 – Integer z value boundaries for normally distributed symmetric data
If we change the data points, the distribution will change, but the z-score criteria will remain valid. As you can see in the following code snippet, an asymmetric distribution is generated rather than a normal distribution:
x = [random.normalvariate(1, 0.5) + random.expovariate(2) for _ in range(10000)]
This produces the following output:

Figure 2.12 – Integer z value boundaries for asymmetric data
Note on the influence of extreme outliers
A drawback of the z-score is that the mean itself is also influenced by extreme outliers. The median can replace a mean to remove this effect. It is flexible to set different criteria in different production cases.
We have covered several of the most important statistics to model variances in a dataset. In the next section, let's work on the data types of features.