Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Data Cleaning and Exploration with Machine Learning

You're reading from   Data Cleaning and Exploration with Machine Learning Get to grips with machine learning techniques to achieve sparkling-clean data quickly

Arrow left icon
Product type Paperback
Published in Aug 2022
Publisher Packt
ISBN-13 9781803241678
Length 542 pages
Edition 1st Edition
Arrow right icon
Author (1):
Arrow left icon
Michael Walker Michael Walker
Author Profile Icon Michael Walker
Michael Walker
Arrow right icon
View More author details
Toc

Table of Contents (23) Chapters Close

Preface 1. Section 1 – Data Cleaning and Machine Learning Algorithms
2. Chapter 1: Examining the Distribution of Features and Targets FREE CHAPTER 3. Chapter 2: Examining Bivariate and Multivariate Relationships between Features and Targets 4. Chapter 3: Identifying and Fixing Missing Values 5. Section 2 – Preprocessing, Feature Selection, and Sampling
6. Chapter 4: Encoding, Transforming, and Scaling Features 7. Chapter 5: Feature Selection 8. Chapter 6: Preparing for Model Evaluation 9. Section 3 – Modeling Continuous Targets with Supervised Learning
10. Chapter 7: Linear Regression Models 11. Chapter 8: Support Vector Regression 12. Chapter 9: K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression 13. Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning
14. Chapter 10: Logistic Regression 15. Chapter 11: Decision Trees and Random Forest Classification 16. Chapter 12: K-Nearest Neighbors for Classification 17. Chapter 13: Support Vector Machine Classification 18. Chapter 14: Naïve Bayes Classification 19. Section 5 – Clustering and Dimensionality Reduction with Unsupervised Learning
20. Chapter 15: Principal Component Analysis 21. Chapter 16: K-Means and DBSCAN Clustering 22. Other Books You May Enjoy

Generating summary statistics for continuous and discrete features

Getting a feel for the distribution of continuous or discrete features is a little more complicated than it is for categorical features. A continuous feature can take an infinite number of values. An example of a continuous feature is weight, as someone can weigh 70 kilograms, or 70.1, or 70.01. Discrete features have a finite number of values, such as the number of birds sighted, or the number of apples purchased. One way of thinking about the difference is that a discrete feature is typically something that has been counted, while a continuous feature is usually captured by measurement, weighing, or timekeeping.

Continuous features will generally be stored as floating-point numbers unless they have been constrained to be whole numbers. In that case, they may be stored as integers. Age for individual humans, for example, is continuous but is usually truncated to an integer.

For most modeling purposes, continuous and discrete features are treated similarly. We would not model age as a categorical feature. We assume that the interval between ages has largely the same meaning between 25 and 26 as it has between 35 and 36, though this breaks down at the extremes. The interval between 1 and 2 years of age for humans is not at all like that between 71 and 72. Data analysts and scientists are usually skeptical of assumed linear relationships between continuous features and targets, though modeling is much easier when that is true.

To understand how a continuous feature (or discrete feature) is distributed, we must examine its central tendency, shape, and spread. Key summary statistics are mean and median for central tendency, skewness and kurtosis for shape, and range, interquartile range, variance, and standard deviation for spread. In this section, we will learn how to use pandas, supplemented by the SciPy library, to get these statistics. We will also discuss important implications for modeling.

We will work with COVID-19 data in this section. The dataset contains one row per country, with total cases and deaths through June 2021, as well as demographic data for each country.

Note

Our World in Data provides COVID-19 public use data at https://ourworldindata.org/coronavirus-source-data. The data that will be used in this section was downloaded on July 9, 2021. There are more columns in the data than I have included. I created the region column based on country.

Follow these steps to generate the summary statistics:

  1. Let's load the COVID .csv file into pandas, set the index, and look at the data. There are 221 rows and 16 columns. The index we set, iso_code, contains a unique value for each row. We use sample to view two countries randomly, rather than the first two (we set a value for random_state to get the same results each time we run the code):
    import pandas as pd
    import numpy as np
    import scipy.stats as scistat
    covidtotals = pd.read_csv("data/covidtotals.csv",
        parse_dates=['lastdate'])
    covidtotals.set_index("iso_code", inplace=True)
    covidtotals.shape
    (221, 16)
    covidtotals.index.nunique()
    221
    covidtotals.sample(2, random_state=6).T
    iso_code                         ISL               CZE
    lastdate                  2021-07-07        2021-07-07
    location                     Iceland           Czechia
    total_cases                    6,555         1,668,277
    total_deaths                      29            30,311
    total_cases_mill              19,209           155,783
    total_deaths_mill                 85             2,830
    population                   341,250        10,708,982
    population_density                 3               137
    median_age                        37                43
    gdp_per_capita                46,483            32,606
    aged_65_older                     14                19
    total_tests_thous                NaN               NaN
    life_expectancy                   83                79
    hospital_beds_thous                3                 7
    diabetes_prevalence                5                 7
    region                Western Europe    Western Europe

Just by looking at these two rows, we can see significant differences in cases and deaths between Iceland and Czechia, even in terms of population size. (total_cases_mill and total_deaths_mill divide cases and deaths per million of population, respectively.) Data analysts are very used to wondering if there is anything else in the data that may explain substantially higher cases and deaths in Czechia than in Iceland. In a sense, we are always engaging in feature selection.

  1. Let's take a look at the data types and number of non-null values for each column. Almost all of the columns are continuous or discrete. We have data on cases and deaths, as well as likely targets, for 192 and 185 rows, respectively. An important data cleaning task we'll have to do will be figuring out what, if anything, we can do about countries that have missing values for our targets. We'll discuss how to handle missing values later:
    covidtotals.info()
    <class 'pandas.core.frame.DataFrame'>
    Index: 221 entries, AFG to ZWE
    Data columns (total 16 columns):
     #   Column             Non-Null Count         Dtype 
    ---  -------            --------------  --------------
     0   lastdate             221 non-null  datetime64[ns]
     1   location             221 non-null          object
     2   total_cases          192 non-null         float64
     3   total_deaths         185 non-null         float64
     4   total_cases_mill     192 non-null         float64
     5   total_deaths_mill    185 non-null         float64
     6   population           221 non-null         float64
     7   population_density   206 non-null         float64
     8   median_age           190 non-null         float64
     9   gdp_per_capita       193 non-null         float64
     10  aged_65_older        188 non-null         float64
     11  total_tests_thous     13 non-null         float64
     12  life_expectancy      217 non-null         float64
     13  hospital_beds_thous  170 non-null         float64
     14  diabetes_prevalence  200 non-null         float64
     15  region               221 non-null          object
    dtypes: datetime64[ns](1), float64(13), object(2)
    memory usage: 29.4+ KB
  2. Now, we are ready to examine the distribution of some of the features. We can get most of the summary statistics we want by using the describe method. The mean and median (50%) are good indicators of the center of the distribution, each with its strengths. It is also good to notice substantial differences between the mean and median, as an indication of skewness. For example, we can see that the mean cases per million is almost twice the median, with 36.7 thousand compared to 19.5 thousand. This is a clear indicator of positive skew. This is also true for deaths per million.

The interquartile range is also quite large for cases and deaths, with the 75th percentile value being about 25 times larger than the 25th percentile value in both cases. We can compare that with the percentage of the population aged 65 and older and diabetes prevalence, where the 75th percentile is just four times or two times that of the 25th percentile, respectively. We can tell right away that those two possible features (aged_65_older and diabetes_prevalence) would have to do a lot of work to explain the huge variance in our targets:

 keyvars = ['location','total_cases_mill','total_deaths_mill',
...  'aged_65_older','diabetes_prevalence']
 covidkeys = covidtotals[keyvars]
 covidkeys.describe()
total_cases_mill total_deaths_mill aged_65_older diabetes_prevalence
count        192.00       185.00    188.00     200.00
mean      36,649.37       683.14      8.61       8.44
std       41,403.98       861.73      6.12       4.89
min            8.52         0.35      1.14       0.99
25%        2,499.75        43.99      3.50       5.34
50%       19,525.73       293.50      6.22       7.20
75%       64,834.62     1,087.89     13.92      10.61
max      181,466.38     5,876.01     27.05      30.53
  1. I sometimes find it helpful to look at the decile values to get a better sense of the distribution. The quantile method can take a single value for quantile, such as quantile(0.25) for the 25th percentile, or a list or tuple, such as quantile((0.25,0.5)) for the 25th and 50th percentiles. In the following code, we're using arange from NumPy (np.arange(0.0, 1.1, 0.1)) to generate an array that goes from 0.0 to 1.0 with a 0.1 increment. We would get the same result if we were to use covidkeys.quantile([0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]):
     covidkeys.quantile(np.arange(0.0, 1.1, 0.1))
          total_cases_mill  total_deaths_mill  aged_65_older  diabetes_prevalence
    0.00         8.52       0.35       1.14     0.99
    0.10       682.13      10.68       2.80     3.30
    0.20     1,717.39      30.22       3.16     4.79
    0.30     3,241.84      66.27       3.86     5.74
    0.40     9,403.58      145.06      4.69     6.70
    0.50     19,525.73     293.50      6.22     7.20
    0.60     33,636.47     556.43      7.93     8.32
    0.70     55,801.33     949.71     11.19    10.08
    0.80     74,017.81    1,333.79    14.92    11.62
    0.90     94,072.18    1,868.89    18.85    13.75
    1.00    181,466.38    5,876.01    27.05    30.53

For cases, deaths, and diabetes prevalence, much of the range (the distance between the min and max values) is in the last 10% of the distribution. This is particularly true for deaths. This hints at possible modeling problems and invites us to take a close look at outliers, something we will do in the next section.

  1. Some machine learning algorithms assume that our features have normal (also referred to as Gaussian) distributions, that they are distributed symmetrically (have low skew), and that they have relatively normal tails (neither excessively high nor excessively low kurtosis). The statistics we have seen so far already suggest a high positive skew for our two likely targets – that is, total cases and deaths per million people in the population. Let's put a finer point on this by calculating both skew and kurtosis for some of the features. For a Gaussian distribution, we expect a value near 0 for skew and 3 for kurtosis. total_deaths_mill has values for skew and kurtosis that are worth noting, and the total_cases_mill and aged_65_older features have excessively low kurtosis (skinny tails):
    covidkeys.skew()
    total_cases_mill        1.21
    total_deaths_mill       2.00
    aged_65_older           0.84
    diabetes_prevalence     1.52
    dtype: float64
     covidkeys.kurtosis()
    total_cases_mill        0.91
    total_deaths_mill       6.58
    aged_65_older          -0.56
    diabetes_prevalence     3.31
    dtype: float64
  2. We can also explicitly test each distribution's normality by looping over the features in the keyvars list and running a Shapiro-Wilk test on the distribution (scistat.shapiro(covidkeys[var].dropna())). Notice that we need to drop missing values with dropna for the test to run. p-values less than 0.05 indicate that we can reject the null hypothesis of normal, which is the case for each of the four features:
    for var in keyvars[1:]:
          stat, p = scistat.shapiro(covidkeys[var].dropna())
          print("feature=", var, "     p-value=", '{:.6f}'.format(p))
     
    feature= total_cases_mill       p-value= 0.000000
    feature= total_deaths_mill      p-value= 0.000000
    feature= aged_65_older          p-value= 0.000000
    feature= diabetes_prevalence    p-value= 0.000000

These results should make us pause if we are considering parametric models such as linear regression. None of the distributions approximates a normal distribution. However, this is not determinative. It is not as simple as deciding that we should use certain models when we have normally distributed features and non-parametric models (say, k-nearest neighbors) when we do not.

We want to do additional data cleaning before we make any modeling decisions. For example, we may decide to remove outliers or determine that it is appropriate to transform the data. We will explore transformations, such as log and polynomial transformations, in several chapters in this book.

This section showed you how to use pandas and SciPy to understand how continuous and discrete features are distributed, including their central tendency, shape, and spread. It makes sense to generate these statistics for any feature or target that might be included in our modeling. This also points us in the direction of more work we need to do to prepare our data for analysis. We need to identify missing values and outliers and figure out how we will handle them. We should also visualize the distribution of our continuous features. This rarely fails to yield additional insights. We will learn how to identify outliers in the next section and create visualizations in the following section.

You have been reading a chapter from
Data Cleaning and Exploration with Machine Learning
Published in: Aug 2022
Publisher: Packt
ISBN-13: 9781803241678
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image