Generating summary statistics for continuous and discrete features
Getting a feel for the distribution of continuous or discrete features is a little more complicated than it is for categorical features. A continuous feature can take an infinite number of values. An example of a continuous feature is weight, as someone can weigh 70 kilograms, or 70.1, or 70.01. Discrete features have a finite number of values, such as the number of birds sighted, or the number of apples purchased. One way of thinking about the difference is that a discrete feature is typically something that has been counted, while a continuous feature is usually captured by measurement, weighing, or timekeeping.
Continuous features will generally be stored as floating-point numbers unless they have been constrained to be whole numbers. In that case, they may be stored as integers. Age for individual humans, for example, is continuous but is usually truncated to an integer.
For most modeling purposes, continuous and discrete features are treated similarly. We would not model age as a categorical feature. We assume that the interval between ages has largely the same meaning between 25 and 26 as it has between 35 and 36, though this breaks down at the extremes. The interval between 1 and 2 years of age for humans is not at all like that between 71 and 72. Data analysts and scientists are usually skeptical of assumed linear relationships between continuous features and targets, though modeling is much easier when that is true.
To understand how a continuous feature (or discrete feature) is distributed, we must examine its central tendency, shape, and spread. Key summary statistics are mean and median for central tendency, skewness and kurtosis for shape, and range, interquartile range, variance, and standard deviation for spread. In this section, we will learn how to use pandas, supplemented by the SciPy library, to get these statistics. We will also discuss important implications for modeling.
We will work with COVID-19 data in this section. The dataset contains one row per country, with total cases and deaths through June 2021, as well as demographic data for each country.
Note
Our World in Data provides COVID-19 public use data at https://ourworldindata.org/coronavirus-source-data. The data that will be used in this section was downloaded on July 9, 2021. There are more columns in the data than I have included. I created the region column based on country.
Follow these steps to generate the summary statistics:
- Let's load the COVID
.csv
file into pandas, set the index, and look at the data. There are 221 rows and 16 columns. The index we set,iso_code
, contains a unique value for each row. We usesample
to view two countries randomly, rather than the first two (we set a value forrandom_state
to get the same results each time we run the code):import pandas as pd import numpy as np import scipy.stats as scistat covidtotals = pd.read_csv("data/covidtotals.csv", parse_dates=['lastdate']) covidtotals.set_index("iso_code", inplace=True) covidtotals.shape (221, 16) covidtotals.index.nunique() 221 covidtotals.sample(2, random_state=6).T iso_code ISL CZE lastdate 2021-07-07 2021-07-07 location Iceland Czechia total_cases 6,555 1,668,277 total_deaths 29 30,311 total_cases_mill 19,209 155,783 total_deaths_mill 85 2,830 population 341,250 10,708,982 population_density 3 137 median_age 37 43 gdp_per_capita 46,483 32,606 aged_65_older 14 19 total_tests_thous NaN NaN life_expectancy 83 79 hospital_beds_thous 3 7 diabetes_prevalence 5 7 region Western Europe Western Europe
Just by looking at these two rows, we can see significant differences in cases and deaths between Iceland and Czechia, even in terms of population size. (total_cases_mill
and total_deaths_mill
divide cases and deaths per million of population, respectively.) Data analysts are very used to wondering if there is anything else in the data that may explain substantially higher cases and deaths in Czechia than in Iceland. In a sense, we are always engaging in feature selection.
- Let's take a look at the data types and number of non-null values for each column. Almost all of the columns are continuous or discrete. We have data on cases and deaths, as well as likely targets, for 192 and 185 rows, respectively. An important data cleaning task we'll have to do will be figuring out what, if anything, we can do about countries that have missing values for our targets. We'll discuss how to handle missing values later:
covidtotals.info() <class 'pandas.core.frame.DataFrame'> Index: 221 entries, AFG to ZWE Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------- -------------- -------------- 0 lastdate 221 non-null datetime64[ns] 1 location 221 non-null object 2 total_cases 192 non-null float64 3 total_deaths 185 non-null float64 4 total_cases_mill 192 non-null float64 5 total_deaths_mill 185 non-null float64 6 population 221 non-null float64 7 population_density 206 non-null float64 8 median_age 190 non-null float64 9 gdp_per_capita 193 non-null float64 10 aged_65_older 188 non-null float64 11 total_tests_thous 13 non-null float64 12 life_expectancy 217 non-null float64 13 hospital_beds_thous 170 non-null float64 14 diabetes_prevalence 200 non-null float64 15 region 221 non-null object dtypes: datetime64[ns](1), float64(13), object(2) memory usage: 29.4+ KB
- Now, we are ready to examine the distribution of some of the features. We can get most of the summary statistics we want by using the
describe
method. The mean and median (50%) are good indicators of the center of the distribution, each with its strengths. It is also good to notice substantial differences between the mean and median, as an indication of skewness. For example, we can see that the mean cases per million is almost twice the median, with 36.7 thousand compared to 19.5 thousand. This is a clear indicator of positive skew. This is also true for deaths per million.
The interquartile range is also quite large for cases and deaths, with the 75th percentile value being about 25 times larger than the 25th percentile value in both cases. We can compare that with the percentage of the population aged 65 and older and diabetes prevalence, where the 75th percentile is just four times or two times that of the 25th percentile, respectively. We can tell right away that those two possible features (aged_65_older
and diabetes_prevalence
) would have to do a lot of work to explain the huge variance in our targets:
keyvars = ['location','total_cases_mill','total_deaths_mill', ... 'aged_65_older','diabetes_prevalence'] covidkeys = covidtotals[keyvars] covidkeys.describe() total_cases_mill total_deaths_mill aged_65_older diabetes_prevalence count 192.00 185.00 188.00 200.00 mean 36,649.37 683.14 8.61 8.44 std 41,403.98 861.73 6.12 4.89 min 8.52 0.35 1.14 0.99 25% 2,499.75 43.99 3.50 5.34 50% 19,525.73 293.50 6.22 7.20 75% 64,834.62 1,087.89 13.92 10.61 max 181,466.38 5,876.01 27.05 30.53
- I sometimes find it helpful to look at the decile values to get a better sense of the distribution. The
quantile
method can take a single value for quantile, such asquantile(0.25)
for the 25th percentile, or a list or tuple, such asquantile((0.25,0.5))
for the 25th and 50th percentiles. In the following code, we're usingarange
from NumPy (np.arange(0.0, 1.1, 0.1)
) to generate an array that goes from 0.0 to 1.0 with a 0.1 increment. We would get the same result if we were to usecovidkeys.quantile([0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0])
:covidkeys.quantile(np.arange(0.0, 1.1, 0.1)) total_cases_mill total_deaths_mill aged_65_older diabetes_prevalence 0.00 8.52 0.35 1.14 0.99 0.10 682.13 10.68 2.80 3.30 0.20 1,717.39 30.22 3.16 4.79 0.30 3,241.84 66.27 3.86 5.74 0.40 9,403.58 145.06 4.69 6.70 0.50 19,525.73 293.50 6.22 7.20 0.60 33,636.47 556.43 7.93 8.32 0.70 55,801.33 949.71 11.19 10.08 0.80 74,017.81 1,333.79 14.92 11.62 0.90 94,072.18 1,868.89 18.85 13.75 1.00 181,466.38 5,876.01 27.05 30.53
For cases, deaths, and diabetes prevalence, much of the range (the distance between the min and max values) is in the last 10% of the distribution. This is particularly true for deaths. This hints at possible modeling problems and invites us to take a close look at outliers, something we will do in the next section.
- Some machine learning algorithms assume that our features have normal (also referred to as Gaussian) distributions, that they are distributed symmetrically (have low skew), and that they have relatively normal tails (neither excessively high nor excessively low kurtosis). The statistics we have seen so far already suggest a high positive skew for our two likely targets – that is, total cases and deaths per million people in the population. Let's put a finer point on this by calculating both skew and kurtosis for some of the features. For a Gaussian distribution, we expect a value near 0 for skew and 3 for kurtosis.
total_deaths_mill
has values for skew and kurtosis that are worth noting, and thetotal_cases_mill
andaged_65_older
features have excessively low kurtosis (skinny tails):covidkeys.skew() total_cases_mill 1.21 total_deaths_mill 2.00 aged_65_older 0.84 diabetes_prevalence 1.52 dtype: float64 covidkeys.kurtosis() total_cases_mill 0.91 total_deaths_mill 6.58 aged_65_older -0.56 diabetes_prevalence 3.31 dtype: float64
- We can also explicitly test each distribution's normality by looping over the features in the
keyvars
list and running a Shapiro-Wilk test on the distribution (scistat.shapiro(covidkeys[var].dropna())
). Notice that we need to drop missing values withdropna
for the test to run. p-values less than 0.05 indicate that we can reject the null hypothesis of normal, which is the case for each of the four features:for var in keyvars[1:]: stat, p = scistat.shapiro(covidkeys[var].dropna()) print("feature=", var, " p-value=", '{:.6f}'.format(p)) feature= total_cases_mill p-value= 0.000000 feature= total_deaths_mill p-value= 0.000000 feature= aged_65_older p-value= 0.000000 feature= diabetes_prevalence p-value= 0.000000
These results should make us pause if we are considering parametric models such as linear regression. None of the distributions approximates a normal distribution. However, this is not determinative. It is not as simple as deciding that we should use certain models when we have normally distributed features and non-parametric models (say, k-nearest neighbors) when we do not.
We want to do additional data cleaning before we make any modeling decisions. For example, we may decide to remove outliers or determine that it is appropriate to transform the data. We will explore transformations, such as log and polynomial transformations, in several chapters in this book.
This section showed you how to use pandas and SciPy to understand how continuous and discrete features are distributed, including their central tendency, shape, and spread. It makes sense to generate these statistics for any feature or target that might be included in our modeling. This also points us in the direction of more work we need to do to prepare our data for analysis. We need to identify missing values and outliers and figure out how we will handle them. We should also visualize the distribution of our continuous features. This rarely fails to yield additional insights. We will learn how to identify outliers in the next section and create visualizations in the following section.