Data Cleaning and Exploration with Machine Learning

Chapter 1: Examining the Distribution of Features and Targets

Machine learning writing and instruction are often algorithm-focused. Sometimes, this gives the impression that all we have to do is choose the right model and that organization-changing insights will follow. But the best place to begin a machine learning project is with an understanding of how the features and targets we will use are distributed.

It is important to make room for the same kind of learning from data that has been central to our work as analysts for decades – studying the distribution of variables, identifying anomalies, and examining bivariate relationships – even as we focus more and more on the accuracy of our predictions.

We will explore tools for doing so in the first three chapters of this book, while also considering implications for model building.

In this chapter, we will use common NumPy and pandas techniques to get a better sense of the attributes of our data. We want to know how key features are distributed before we do any predictive analyses. We also want to know the central tendency, shape, and spread of the distribution of each continuous feature and have a count for each value for categorical features. We will take advantage of very handy NumPy and pandas tools for generating summary statistics, such as the mean, min, and max, as well as standard deviation.

After that, we will create visualizations of key features, including histograms and boxplots, to give us a better sense of the distribution of each feature than we can get by just looking at summary statistics. We will hint at the implications of feature distribution for data transformation, encoding and scaling, and the modeling that we will be doing in subsequent chapters with the same data.

Specifically, in this chapter, we are going to cover the following topics:

Subsetting data
Generating frequencies for categorical features
Generating summary statistics for continuous features
Identifying extreme values and outliers in univariate analysis
Using histograms, boxplots, and violin plots to examine the distribution of continuous features

Subsetting data

Almost every statistical modeling project I have worked on has required removing some data from the analysis. Often, this is because of missing values or outliers. Sometimes, there are theoretical reasons for limiting our analysis to a subset of the data. For example, we have weather data going back to 1600, but our analysis goals only involve changes in weather since 1900. Fortunately, the subsetting tools in pandas are quite powerful and flexible. We will work with data from the United States National Longitudinal Survey (NLS) of Youth in this section.

Note

The NLS of Youth is conducted by the United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997 who were born between 1980 and 1985, with annual follow-ups each year through 2017. For this recipe, I pulled 89 variables on grades, employment, income, and attitudes toward government from the hundreds of data items on the survey. Separate files for SPSS, Stata, and SAS can be downloaded from the repository. The NLS data is available for public use at https://www.nlsinfo.org/investigator/pages/search.

Let's start subsetting the data using pandas:

We will start by loading the NLS data. We also set an index:

import pandas as pd
import numpy as np
nls97 = pd.read_csv("data/nls97.csv")
nls97.set_index("personid", inplace=True)

Let's select a few columns from the NLS data. The following code creates a new DataFrame that contains some demographic and employment data. A useful feature of pandas is that the new DataFrame retains the index of the old DataFrame, as shown here:
```
democols = ['gender','birthyear','maritalstatus',
 'weeksworked16','wageincome','highestdegree']
nls97demo = nls97[democols]
nls97demo.index.name
'personid'
```

We can use slicing to select rows by position. nls97demo[1000:1004] selects every row, starting from the row indicated by the integer to the left of the colon (1000, in this case) up to, but not including, the row indicated by the integer to the right of the colon (1004). The row at 1000 is the 1,001st row because of zero-based indexing. Each row appears as a column in the output since we have transposed the resulting DataFrame:

nls97demo[1000:1004].T
personid      195884       195891        195970\
gender        Male         Male          Female
birthyear     1981         1980          1982
maritalstatus NaN          Never-married Never-married
weeksworked16 NaN          53            53
wageincome    NaN          14,000        52,000   
highestdegree 4.Bachelors  2.High School 4.Bachelors
personid       195996  
gender         Female  
birthyear      1980  
maritalstatus  NaN  
weeksworked16  NaN  
wageincome     NaN
highestdegree  3.Associates

We can also skip rows over the interval by setting a value for the step after the second colon. The default value for the step is 1. The value for the following step is 2, which means that every other row between 1000 and 1004 will be selected:

nls97demo[1000:1004:2].T
personid        195884       195970
gender          Male         Female
birthyear       1981         1982
maritalstatus   NaN          Never-married
weeksworked16   NaN          53
wageincome      NaN          52,000
highestdegree   4.Bachelors  4. Bachelors

If we do not include a value to the left of the colon, row selection will start with the first row. Notice that this returns the same DataFrame as the head method does:

nls97demo[:3].T
personid       100061         100139          100284
gender         Female         Male            Male
birthyear      1980           1983            1984
maritalstatus  Married        Married         Never-married
weeksworked16  48             53              47
wageincome     12,500         120,000         58,000
highestdegree  2.High School  2. High School  0.None
nls97demo.head(3).T
personid       100061         100139         100284
gender         Female         Male           Male
birthyear      1980           1983           1984
maritalstatus  Married        Married        Never-married
weeksworked16  48             53             47
wageincome     12,500         120,000        58,000
highestdegree  2.High School  2.High School  0. None

If we use a negative number, -n, to the left of the colon, the last n rows of the DataFrame will be returned. This returns the same DataFrame as the tail method does:

 nls97demo[-3:].T
personid       999543          999698        999963
gender         Female         Female         Female
birthyear      1984           1983           1982
maritalstatus  Divorced       Never-married  Married
weeksworked16  0              0              53
wageincome     NaN            NaN            50,000
highestdegree  2.High School  2.High School  4. Bachelors
 nls97demo.tail(3).T
personid       999543         999698         999963
gender         Female         Female         Female
birthyear      1984           1983           1982
maritalstatus  Divorced       Never-married  Married
weeksworked16  0              0              53
wageincome     NaN            NaN            50,000
highestdegree  2.High School  2.High School  4. Bachelors

We can select rows by index value using the loc accessor. Recall that for the nls97demo DataFrame, the index is personid. We can pass a list of the index labels to the loc accessor, such as loc[[195884,195891,195970]], to get the rows associated with those labels. We can also pass a lower and upper bound of index labels, such as loc[195884:195970], to retrieve the indicated rows:

 nls97demo.loc[[195884,195891,195970]].T
personid       195884       195891         195970
gender         Male         Male           Female
birthyear      1981         1980           1982
maritalstatus  NaN          Never-married  Never-married
weeksworked16  NaN          53             53
wageincome     NaN          14,000         52,000
highestdegree  4.Bachelors  2.High School  4.Bachelors
 nls97demo.loc[195884:195970].T
personid       195884       195891         195970
gender         Male         Male           Female
birthyear      1981         1980           1982
maritalstatus  NaN          Never-married  Never-married
weeksworked16  NaN          53             53
wageincome     NaN          14,000         52,000
highestdegree  4.Bachelors  2.High School  4.Bachelors

To select rows by position, rather than by index label, we can use the iloc accessor. We can pass a list of position numbers, such as iloc[[0,1,2]], to the accessor to get the rows at those positions. We can pass a range, such as iloc[0:3], to get rows between the lower and upper bound, not including the row at the upper bound. We can also use the iloc accessor to select the last n rows. iloc[-3:] selects the last three rows:

 nls97demo.iloc[[0,1,2]].T
personid       100061         100139         100284
gender         Female         Male           Male
birthyear      1980           1983           1984
maritalstatus  Married        Married        Never-married
weeksworked16  48             53             47
wageincome     12,500         120,000        58,000
highestdegree  2.High School  2.High School  0. None
 nls97demo.iloc[0:3].T
personid       100061         100139         100284
gender         Female         Male           Male
birthyear      1980           1983           1984
maritalstatus  Married        Married        Never-married
weeksworked16  48             53             47
wageincome     12,500         120,000        58,000
highestdegree  2.High School  2.High School  0. None
 nls97demo.iloc[-3:].T
personid       999543         999698         999963
gender         Female         Female         Female
birthyear      1984           1983           1982
maritalstatus  Divorced       Never-married  Married
weeksworked16  0              0              53
wageincome     NaN            NaN            50,000
highestdegree  2.High School  2.High School  4. Bachelors

Often, we need to select rows based on a column value or the values of several columns. We can do this in pandas by using Boolean indexing. Here, we pass a vector of Boolean values (which can be a Series) to the loc accessor or the bracket operator. The Boolean vector needs to have the same index as the DataFrame.

Let's try this using the nightlyhrssleep column on the NLS DataFrame. We want a Boolean Series that is True for people who sleep 6 or fewer hours a night (the 33rd percentile) and False if nightlyhrssleep is greater than 6 or is missing. sleepcheckbool = nls97.nightlyhrssleep<=lowsleepthreshold creates the boolean Series. If we display the first few values of sleepcheckbool, we will see that we are getting the expected values. We can also confirm that the sleepcheckbool index is equal to the nls97 index:

nls97.nightlyhrssleep.head()
personid
100061     6
100139     8
100284     7
100292     nan
100583     6
Name: nightlyhrssleep, dtype: float64
lowsleepthreshold = nls97.nightlyhrssleep.quantile(0.33)
lowsleepthreshold
6.0
sleepcheckbool = nls97.nightlyhrssleep<=lowsleepthreshold
sleepcheckbool.head()
personid
100061    True
100139    False
100284    False
100292    False
100583    True
Name: nightlyhrssleep, dtype: bool
sleepcheckbool.index.equals(nls97.index)
True

Since the sleepcheckbool Series has the same index as nls97, we can just pass it to the loc accessor to create a DataFrame containing people who sleep 6 hours or less a night. This is a little pandas magic here. It handles the index alignment for us:

lowsleep = nls97.loc[sleepcheckbool]
lowsleep.shape
(3067, 88)

We could have created the lowsleep subset of our data in one step, which is what we would typically do unless we need the Boolean Series for some other purpose:
```
lowsleep = nls97.loc[nls97.nightlyhrssleep<=lowsleepthreshold]
lowsleep.shape
(3067, 88)
```
We can pass more complex conditions to the loc accessor and evaluate the values of multiple columns. For example, we can select rows where nightlyhrssleep is less than or equal to the threshold and childathome (number of children living at home) is greater than or equal to 3:
```
lowsleep3pluschildren = \
  nls97.loc[(nls97.nightlyhrssleep<=lowsleepthreshold)
    & (nls97.childathome>=3)]
lowsleep3pluschildren.shape
(623, 88)
```

Each condition in nls97.loc[(nls97.nightlyhrssleep<=lowsleepthreshold) & (nls97.childathome>3)] is placed in parentheses. An error will be generated if the parentheses are excluded. The & operator is the equivalent of and in standard Python, meaning that both conditions have to be True for the row to be selected. We could have used | for or if we wanted to select the row if either condition was True.

Finally, we can select rows and columns at the same time. The expression to the left of the comma selects rows, while the list to the right of the comma selects columns:

lowsleep3pluschildren = \
  nls97.loc[(nls97.nightlyhrssleep<=lowsleepthreshold)
    & (nls97.childathome>=3),
    ['nightlyhrssleep','childathome']]
lowsleep3pluschildren.shape
(623, 2)

We used three different tools to select columns and rows from a pandas DataFrame in the last two sections: the [] bracket operator and two pandas-specific accessors, loc and iloc. This will be a little confusing if you are new to pandas, but it becomes clear which tool to use in which situation after just a few months. If you came to pandas with a fair bit of Python and NumPy experience, you will likely find the [] operator most familiar. However, the pandas documentation recommends against using the [] operator for production code. The loc accessor is used for selecting rows by Boolean indexing or by index label, while the iloc accessor is used for selecting rows by row number.

This section was a brief primer on selecting columns and rows with pandas. Although we did not go into too much detail on this, most of what you need to know to subset data was covered, as well as everything you need to know to understand the pandas-specific material in the rest of this book. We will start putting some of that to work in the next two sections by creating frequencies and summary statistics for our features.

Generating frequencies for categorical features

Categorical features can be either nominal or ordinal. Nominal features, such as gender, species name, or country, have a limited number of possible values, and are either strings or are numerical without having any intrinsic numerical meaning. For example, if country is represented by 1 for Afghanistan, 2 for Albania, and so on, the data is numerical but it does not make sense to perform arithmetic operations on those values.

Ordinal features also have a limited number of possible values but are different from nominal features in that the order of the values matters. A Likert scale rating (ranging from 1 for very unlikely to 5 for very likely) is an example of an ordinal feature. Nonetheless, arithmetic operations would not typically make sense because there is no uniform and meaningful distance between values.

Before we begin modeling, we want to have counts of all the possible values for the categorical features we may use. This is typically referred to as a one-way frequency distribution. Fortunately, pandas makes this very easy to do. We can quickly select columns from a pandas DataFrame and use the value_counts method to generate counts for each categorical value:

Let's load the NLS data, create a DataFrame that contains just the first 20 columns of the data, and look at the data types:

nls97 = pd.read_csv("data/nls97.csv")
nls97.set_index("personid", inplace=True)
nls97abb = nls97.iloc[:,:20]
nls97abb.dtypes
gender                   object
birthmonth               int64
birthyear                int64
highestgradecompleted    float64
maritalstatus            object
childathome              float64
childnotathome           float64
wageincome               float64
weeklyhrscomputer        object
weeklyhrstv              object
nightlyhrssleep          float64
satverbal                float64
satmath                  float64
gpaoverall               float64
gpaenglish               float64
gpamath                  float64
gpascience               float64
highestdegree            object
govprovidejobs           object
govpricecontrols         object
dtype: object

Note

Recall from the previous section how column and row selection works with the loc and iloc accessors. The colon to the left of the comma indicates that we want all the rows, while :20 to the right of the comma gets us the first 20 columns.

All of the object type columns in the preceding code are categorical. We can use value_counts to see the counts for each value for maritalstatus. We can also use dropna=False to get value_counts to show the missing values (NaN):

nls97abb.maritalstatus.value_counts(dropna=False)
Married          3066
Never-married    2766
NaN              2312
Divorced         663
Separated        154
Widowed          23
Name: maritalstatus, dtype: int64

If we just want the number of missing values, we can chain the isnull and sum methods. isnull returns a Boolean Series containing True values when maritalstatus is missing and False otherwise. sum then counts the number of True values, since it will interpret True values as 1 and False values as 0:
```
nls97abb.maritalstatus.isnull().sum()
2312
```
You have probably noticed that the maritalstatus values were sorted by frequency by default. You can sort them alphabetically by values by sorting the index. We can do this by taking advantage of the fact that value_counts returns a Series with the values as the index:
```
marstatcnt = nls97abb.maritalstatus.value_counts(dropna=False)
type(marstatcnt)
<class 'pandas.core.series.Series'>
marstatcnt.index
Index(['Married', 'Never-married', nan, 'Divorced', 'Separated', 'Widowed'], dtype='object')
```

To sort the index, we just need to call sort_index:

marstatcnt.sort_index()
Divorced         663
Married          3066
Never-married    2766
Separated        154
Widowed          23
NaN              2312
Name: maritalstatus, dtype: int64

Of course, we could have gotten the same results in one step with nls97.maritalstatus.value_counts(dropna=False).sort_index(). We can also show ratios instead of counts by setting normalize to True. In the following code, we can see that 34% of the responses were Married (notice that we did not set dropna to True, so missing values have been excluded):

nls97.maritalstatus.\
  value_counts(normalize=True, dropna=False).\
     sort_index()
 
Divorced             0.07
Married              0.34
Never-married        0.31
Separated            0.02
Widowed              0.00
NaN                  0.26
Name: maritalstatus, dtype: float64

pandas has a category data type that can store data much more efficiently than the object data type when a column has a limited number of values. Since we already know that all of our object columns contain categorical data, we should convert those columns into the category data type. In the following code, we're creating a list that contains the column names for the object columns, catcols. Then, we're looping through those columns and using astype to change the data type to category:

catcols = nls97abb.select_dtypes(include=["object"]).columns
for col in nls97abb[catcols].columns:
...      nls97abb[col] = nls97abb[col].astype('category')
... 
nls97abb[catcols].dtypes
gender                   category
maritalstatus            category
weeklyhrscomputer        category
weeklyhrstv              category
highestdegree            category
govprovidejobs           category
govpricecontrols         category
dtype: object

Let's check our category features for missing values. There are no missing values for gender and very few for highestdegree. But the overwhelming majority of values for govprovidejobs (the government should provide jobs) and govpricecontrols (the government should control prices) are missing. This means that those features probably won't be useful for most modeling:
```
nls97abb[catcols].isnull().sum()
gender               0
maritalstatus        2312
weeklyhrscomputer    2274
weeklyhrstv          2273
highestdegree        31
govprovidejobs       7151
govpricecontrols     7125
dtype: int64
```

We can generate frequencies for multiple features at once by passing a value_counts call to apply. We can use filter to select the columns that we want – in this case, all the columns with gov in their name. Note that the missing values for each feature have been omitted since we did not set dropna to False:

 nls97abb.filter(like="gov").apply(pd.value_counts, normalize=True)
                 govprovidejobs    govpricecontrols
1. Definitely              0.25                0.54
2. Probably                0.34                0.33
3. Probably not            0.25                0.09
4. Definitely not          0.16                0.04

We can use the same frequencies on a subset of our data. If, for example, we want to see the responses of only married people to the government role questions, we can do that subsetting by placing nls97abb[nls97abb.maritalstatus=="Married"] before filter:

 nls97abb.loc[nls97abb.maritalstatus=="Married"].\
 filter(like="gov").\
   apply(pd.value_counts, normalize=True)
                 govprovidejobs    govpricecontrols
1. Definitely              0.17                0.46
2. Probably                0.33                0.38
3. Probably not            0.31                0.11
4. Definitely not          0.18                0.05

Since, in this case, there were only two gov columns, it may have been easier to do the following:

 nls97abb.loc[nls97abb.maritalstatus=="Married",
   ['govprovidejobs','govpricecontrols']].\
   apply(pd.value_counts, normalize=True)
                  govprovidejobs     govpricecontrols
1. Definitely               0.17                 0.46
2. Probably                 0.33                 0.38
3. Probably not             0.31                 0.11
4. Definitely not           0.18                 0.05

Nonetheless, it will often be easier to use filter since it is not unusual to have to do the same cleaning or exploration task on groups of features with similar names.

There are times when we may want to model a continuous or discrete feature as categorical. The NLS DataFrame contains highestgradecompleted. A year increase from 5 to 6 may not be as important as that from 11 to 12 in terms of its impact on a target. Let's create a dichotomous feature instead – that is, 1 when the person has completed 12 or more grades, 0 if they have completed less than that, and missing when highestgradecompleted is missing.

We need to do a little bit of cleaning up first, though. highestgradecompleted has two logical missing values – an actual NaN value that pandas recognizes as missing and a 95 value that the survey designers intend for us to also treat as missing for most use cases. Let's use replace to fix that before moving on:
```
nls97abb.highestgradecompleted.\
  replace(95, np.nan, inplace=True)
```

We can use NumPy's where function to assign values to highschoolgrad based on the values of highestgradecompleted. If highestgradecompleted is null (NaN), we assign NaN to our new column, highschoolgrad. If the value for highestgradecompleted is not null, the next clause tests for a value less than 12, setting highschoolgrad to 0 if that is true, and to 1 otherwise. We can confirm that the new column, highschoolgrad, contains the values we want by using groupby to get the min and max values of highestgradecompleted at each level of highschoolgrad:

nls97abb['highschoolgrad'] = \
  np.where(nls97abb.highestgradecompleted.isnull(),np.nan, \
  np.where(nls97abb.highestgradecompleted<12,0,1))
 
nls97abb.groupby(['highschoolgrad'], dropna=False) \
  ['highestgradecompleted'].agg(['min','max','size'])
                  min       max       size
highschoolgrad                
0                   5        11       1231
1                  12        20       5421
nan               nan       nan       2332
 nls97abb['highschoolgrad'] = \
...  nls97abb['highschoolgrad'].astype('category')

While 12 makes conceptual sense as the threshold for classifying our new feature, highschoolgrad, this would present some modeling challenges if we intended to use highschoolgrad as a target. There is a pretty substantial class imbalance, with highschoolgrad equal to 1 class being more than 4 times the size of the 0 group. We should explore using more groups to represent highestgradecompleted.

One way to do this with pandas is with the qcut function. We can set the q parameter of qcut to 6 to create six groups that are as evenly distributed as possible. These groups are now closer to being balanced:

nls97abb['highgradegroup'] = \
  pd.qcut(nls97abb['highestgradecompleted'], 
   q=6, labels=[1,2,3,4,5,6])
 
nls97abb.groupby(['highgradegroup'])['highestgradecompleted'].\
    agg(['min','max','size'])
                  min         max      size
highgradegroup                
1                   5          11       1231
2                  12          12       1389
3                  13          14       1288
4                  15          16       1413
5                  17          17        388
6                  18          20        943
nls97abb['highgradegroup'] = \
    nls97abb['highgradegroup'].astype('category')

Finally, I typically find it helpful to generate frequencies for all the categorical features and save that output so that I can refer to it later. I rerun that code whenever I make some change to the data that may change these frequencies. The following code iterates over all the columns that are of the category data type and runs value_counts:

 freqout = open('views/frequencies.txt', 'w') 
 for col in nls97abb.select_dtypes(include=["category"]):
      print(col, "----------------------",
        "frequencies",
      nls97abb[col].value_counts(dropna=False).sort_index(),
        "percentages",
      nls97abb[col].value_counts(normalize=True).\
        sort_index(),
      sep="\n\n", end="\n\n\n", file=freqout)
 
 freqout.close()

These are the key techniques for generating one-way frequencies for the categorical features in your data. The real star of the show has been the value_counts method. We can use value_counts to create frequencies a Series at a time, use it with apply for multiple columns, or iterate over several columns and call value_counts each time. We have looked at examples of each in this section. Next, let's explore some techniques for examining the distribution of continuous features.

Generating summary statistics for continuous and discrete features

Getting a feel for the distribution of continuous or discrete features is a little more complicated than it is for categorical features. A continuous feature can take an infinite number of values. An example of a continuous feature is weight, as someone can weigh 70 kilograms, or 70.1, or 70.01. Discrete features have a finite number of values, such as the number of birds sighted, or the number of apples purchased. One way of thinking about the difference is that a discrete feature is typically something that has been counted, while a continuous feature is usually captured by measurement, weighing, or timekeeping.

Continuous features will generally be stored as floating-point numbers unless they have been constrained to be whole numbers. In that case, they may be stored as integers. Age for individual humans, for example, is continuous but is usually truncated to an integer.

For most modeling purposes, continuous and discrete features are treated similarly. We would not model age as a categorical feature. We assume that the interval between ages has largely the same meaning between 25 and 26 as it has between 35 and 36, though this breaks down at the extremes. The interval between 1 and 2 years of age for humans is not at all like that between 71 and 72. Data analysts and scientists are usually skeptical of assumed linear relationships between continuous features and targets, though modeling is much easier when that is true.

To understand how a continuous feature (or discrete feature) is distributed, we must examine its central tendency, shape, and spread. Key summary statistics are mean and median for central tendency, skewness and kurtosis for shape, and range, interquartile range, variance, and standard deviation for spread. In this section, we will learn how to use pandas, supplemented by the SciPy library, to get these statistics. We will also discuss important implications for modeling.

We will work with COVID-19 data in this section. The dataset contains one row per country, with total cases and deaths through June 2021, as well as demographic data for each country.

Note

Our World in Data provides COVID-19 public use data at https://ourworldindata.org/coronavirus-source-data. The data that will be used in this section was downloaded on July 9, 2021. There are more columns in the data than I have included. I created the region column based on country.

Follow these steps to generate the summary statistics:

Let's load the COVID .csv file into pandas, set the index, and look at the data. There are 221 rows and 16 columns. The index we set, iso_code, contains a unique value for each row. We use sample to view two countries randomly, rather than the first two (we set a value for random_state to get the same results each time we run the code):

import pandas as pd
import numpy as np
import scipy.stats as scistat
covidtotals = pd.read_csv("data/covidtotals.csv",
    parse_dates=['lastdate'])
covidtotals.set_index("iso_code", inplace=True)
covidtotals.shape
(221, 16)
covidtotals.index.nunique()
221
covidtotals.sample(2, random_state=6).T
iso_code                         ISL               CZE
lastdate                  2021-07-07        2021-07-07
location                     Iceland           Czechia
total_cases                    6,555         1,668,277
total_deaths                      29            30,311
total_cases_mill              19,209           155,783
total_deaths_mill                 85             2,830
population                   341,250        10,708,982
population_density                 3               137
median_age                        37                43
gdp_per_capita                46,483            32,606
aged_65_older                     14                19
total_tests_thous                NaN               NaN
life_expectancy                   83                79
hospital_beds_thous                3                 7
diabetes_prevalence                5                 7
region                Western Europe    Western Europe

Just by looking at these two rows, we can see significant differences in cases and deaths between Iceland and Czechia, even in terms of population size. (total_cases_mill and total_deaths_mill divide cases and deaths per million of population, respectively.) Data analysts are very used to wondering if there is anything else in the data that may explain substantially higher cases and deaths in Czechia than in Iceland. In a sense, we are always engaging in feature selection.

Let's take a look at the data types and number of non-null values for each column. Almost all of the columns are continuous or discrete. We have data on cases and deaths, as well as likely targets, for 192 and 185 rows, respectively. An important data cleaning task we'll have to do will be figuring out what, if anything, we can do about countries that have missing values for our targets. We'll discuss how to handle missing values later:

covidtotals.info()
<class 'pandas.core.frame.DataFrame'>
Index: 221 entries, AFG to ZWE
Data columns (total 16 columns):
 #   Column             Non-Null Count         Dtype 
---  -------            --------------  --------------
 0   lastdate             221 non-null  datetime64[ns]
 1   location             221 non-null          object
 2   total_cases          192 non-null         float64
 3   total_deaths         185 non-null         float64
 4   total_cases_mill     192 non-null         float64
 5   total_deaths_mill    185 non-null         float64
 6   population           221 non-null         float64
 7   population_density   206 non-null         float64
 8   median_age           190 non-null         float64
 9   gdp_per_capita       193 non-null         float64
 10  aged_65_older        188 non-null         float64
 11  total_tests_thous     13 non-null         float64
 12  life_expectancy      217 non-null         float64
 13  hospital_beds_thous  170 non-null         float64
 14  diabetes_prevalence  200 non-null         float64
 15  region               221 non-null          object
dtypes: datetime64[ns](1), float64(13), object(2)
memory usage: 29.4+ KB

Now, we are ready to examine the distribution of some of the features. We can get most of the summary statistics we want by using the describe method. The mean and median (50%) are good indicators of the center of the distribution, each with its strengths. It is also good to notice substantial differences between the mean and median, as an indication of skewness. For example, we can see that the mean cases per million is almost twice the median, with 36.7 thousand compared to 19.5 thousand. This is a clear indicator of positive skew. This is also true for deaths per million.

The interquartile range is also quite large for cases and deaths, with the 75th percentile value being about 25 times larger than the 25th percentile value in both cases. We can compare that with the percentage of the population aged 65 and older and diabetes prevalence, where the 75th percentile is just four times or two times that of the 25th percentile, respectively. We can tell right away that those two possible features (aged_65_older and diabetes_prevalence) would have to do a lot of work to explain the huge variance in our targets:

 keyvars = ['location','total_cases_mill','total_deaths_mill',
...  'aged_65_older','diabetes_prevalence']
 covidkeys = covidtotals[keyvars]
 covidkeys.describe()
total_cases_mill total_deaths_mill aged_65_older diabetes_prevalence
count        192.00       185.00    188.00     200.00
mean      36,649.37       683.14      8.61       8.44
std       41,403.98       861.73      6.12       4.89
min            8.52         0.35      1.14       0.99
25%        2,499.75        43.99      3.50       5.34
50%       19,525.73       293.50      6.22       7.20
75%       64,834.62     1,087.89     13.92      10.61
max      181,466.38     5,876.01     27.05      30.53

I sometimes find it helpful to look at the decile values to get a better sense of the distribution. The quantile method can take a single value for quantile, such as quantile(0.25) for the 25th percentile, or a list or tuple, such as quantile((0.25,0.5)) for the 25th and 50th percentiles. In the following code, we're using arange from NumPy (np.arange(0.0, 1.1, 0.1)) to generate an array that goes from 0.0 to 1.0 with a 0.1 increment. We would get the same result if we were to use covidkeys.quantile([0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]):

 covidkeys.quantile(np.arange(0.0, 1.1, 0.1))
      total_cases_mill  total_deaths_mill  aged_65_older  diabetes_prevalence
0.00         8.52       0.35       1.14     0.99
0.10       682.13      10.68       2.80     3.30
0.20     1,717.39      30.22       3.16     4.79
0.30     3,241.84      66.27       3.86     5.74
0.40     9,403.58      145.06      4.69     6.70
0.50     19,525.73     293.50      6.22     7.20
0.60     33,636.47     556.43      7.93     8.32
0.70     55,801.33     949.71     11.19    10.08
0.80     74,017.81    1,333.79    14.92    11.62
0.90     94,072.18    1,868.89    18.85    13.75
1.00    181,466.38    5,876.01    27.05    30.53

For cases, deaths, and diabetes prevalence, much of the range (the distance between the min and max values) is in the last 10% of the distribution. This is particularly true for deaths. This hints at possible modeling problems and invites us to take a close look at outliers, something we will do in the next section.

Some machine learning algorithms assume that our features have normal (also referred to as Gaussian) distributions, that they are distributed symmetrically (have low skew), and that they have relatively normal tails (neither excessively high nor excessively low kurtosis). The statistics we have seen so far already suggest a high positive skew for our two likely targets – that is, total cases and deaths per million people in the population. Let's put a finer point on this by calculating both skew and kurtosis for some of the features. For a Gaussian distribution, we expect a value near 0 for skew and 3 for kurtosis. total_deaths_mill has values for skew and kurtosis that are worth noting, and the total_cases_mill and aged_65_older features have excessively low kurtosis (skinny tails):
```
covidkeys.skew()
total_cases_mill        1.21
total_deaths_mill       2.00
aged_65_older           0.84
diabetes_prevalence     1.52
dtype: float64
 covidkeys.kurtosis()
total_cases_mill        0.91
total_deaths_mill       6.58
aged_65_older          -0.56
diabetes_prevalence     3.31
dtype: float64
```
We can also explicitly test each distribution's normality by looping over the features in the keyvars list and running a Shapiro-Wilk test on the distribution (scistat.shapiro(covidkeys[var].dropna())). Notice that we need to drop missing values with dropna for the test to run. p-values less than 0.05 indicate that we can reject the null hypothesis of normal, which is the case for each of the four features:
```
for var in keyvars[1:]:
      stat, p = scistat.shapiro(covidkeys[var].dropna())
      print("feature=", var, "     p-value=", '{:.6f}'.format(p))
 
feature= total_cases_mill       p-value= 0.000000
feature= total_deaths_mill      p-value= 0.000000
feature= aged_65_older          p-value= 0.000000
feature= diabetes_prevalence    p-value= 0.000000
```

These results should make us pause if we are considering parametric models such as linear regression. None of the distributions approximates a normal distribution. However, this is not determinative. It is not as simple as deciding that we should use certain models when we have normally distributed features and non-parametric models (say, k-nearest neighbors) when we do not.

We want to do additional data cleaning before we make any modeling decisions. For example, we may decide to remove outliers or determine that it is appropriate to transform the data. We will explore transformations, such as log and polynomial transformations, in several chapters in this book.

This section showed you how to use pandas and SciPy to understand how continuous and discrete features are distributed, including their central tendency, shape, and spread. It makes sense to generate these statistics for any feature or target that might be included in our modeling. This also points us in the direction of more work we need to do to prepare our data for analysis. We need to identify missing values and outliers and figure out how we will handle them. We should also visualize the distribution of our continuous features. This rarely fails to yield additional insights. We will learn how to identify outliers in the next section and create visualizations in the following section.

Filter reviews by

All

Packt verified reviews

Amazon verified reviews

Yiqiao Yin Sep 02, 2022

Many individuals who know how to run machine learning algorithms do not have a good sense of the statistical assumptions they make and how to match the properties of the data to the algorithm for the best results.

Amazon Verified review

Lance Oct 14, 2022

This book is amazingly comprehensive and clear in how to work with data and manage it effectively and ethically to be prepared for machine learning. Walker takes readers through each step with accessible language that feels more suggestive or open to consideration by the readers; in that way, it feels conversational rather than directional and that he is along for the deep dive into machine learning with the reader. He also provides various tips and insights that can anticipate issues or problems that the reader may run into.

Kedeisha B. Oct 29, 2022

This is hands down one of the best books I have gone through. So much hands on exercises with thoughtful explanations.What I enjoyed about this book is that is teaches what I find missing in many popular courses. The data cleaning and exploration techniques to prepare your data to best fit the statistical assumptions of a model that achieves optimal results.Rather than provide every data cleaning or feature selection technique known to man, it teaches the 20% that drive 80% of value.This is an amazing book that comes with a great GitHub repository where you can code along with each chapter. If you are looking to transition into a Data Science role or in the earlier stages of your career, don't walk, but run and get this book.

Jay Oct 18, 2022

This book clearly explains all the essential concepts of data cleaning and gives very good insight on machine learning algorithms. You also get all the example code which you can play with. People who want to explore the ML AI domain are often intimidated by the amount of resources they find online. This book gives you solid fundamentals and doesn’t distract you from your goal.

Adam Bush Sep 24, 2022

The author makes clear at the start that anyone can find themselves in this text. And it's true! This was not my field but I found myself turning the pages as I would a novel. The author is a teacher at heart-- at that comes across throughout. I was not talked at nor talked down to, but, instead, guided towards a new understanding of the iterative processes and possibilities of predictive analysis.

Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly

What do you get with eBook?

Data Cleaning and Exploration with Machine Learning

Chapter 1: Examining the Distribution of Features and Targets

Technical requirements

Subsetting data

Generating frequencies for categorical features

Generating summary statistics for continuous and discrete features

Page 1 of 8

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs

Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly

What do you get with eBook?

Contact Details

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Contact Details

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs