Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Data Cleaning and Exploration with Machine Learning
Data Cleaning and Exploration with Machine Learning

Data Cleaning and Exploration with Machine Learning: Get to grips with machine learning techniques to achieve sparkling-clean data quickly

eBook
$29.99 $33.99
Paperback
$41.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Data Cleaning and Exploration with Machine Learning

Chapter 1: Examining the Distribution of Features and Targets

Machine learning writing and instruction are often algorithm-focused. Sometimes, this gives the impression that all we have to do is choose the right model and that organization-changing insights will follow. But the best place to begin a machine learning project is with an understanding of how the features and targets we will use are distributed.

It is important to make room for the same kind of learning from data that has been central to our work as analysts for decades – studying the distribution of variables, identifying anomalies, and examining bivariate relationships – even as we focus more and more on the accuracy of our predictions.

We will explore tools for doing so in the first three chapters of this book, while also considering implications for model building.

In this chapter, we will use common NumPy and pandas techniques to get a better sense of the attributes of our data. We want to know how key features are distributed before we do any predictive analyses. We also want to know the central tendency, shape, and spread of the distribution of each continuous feature and have a count for each value for categorical features. We will take advantage of very handy NumPy and pandas tools for generating summary statistics, such as the mean, min, and max, as well as standard deviation.

After that, we will create visualizations of key features, including histograms and boxplots, to give us a better sense of the distribution of each feature than we can get by just looking at summary statistics. We will hint at the implications of feature distribution for data transformation, encoding and scaling, and the modeling that we will be doing in subsequent chapters with the same data.

Specifically, in this chapter, we are going to cover the following topics:

  • Subsetting data
  • Generating frequencies for categorical features
  • Generating summary statistics for continuous features
  • Identifying extreme values and outliers in univariate analysis
  • Using histograms, boxplots, and violin plots to examine the distribution of continuous features

Technical requirements

This chapter will rely heavily on the pandas, NumPy, and Matplotlib libraries, but you don't require any prior knowledge of these. If you have installed Python from a scientific distribution, such as Anaconda or WinPython, then these libraries are probably already installed. If you need to install one of them to run the code in this chapter, you can run pip install [package name] from a terminal.

Subsetting data

Almost every statistical modeling project I have worked on has required removing some data from the analysis. Often, this is because of missing values or outliers. Sometimes, there are theoretical reasons for limiting our analysis to a subset of the data. For example, we have weather data going back to 1600, but our analysis goals only involve changes in weather since 1900. Fortunately, the subsetting tools in pandas are quite powerful and flexible. We will work with data from the United States National Longitudinal Survey (NLS) of Youth in this section.

Note

The NLS of Youth is conducted by the United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997 who were born between 1980 and 1985, with annual follow-ups each year through 2017. For this recipe, I pulled 89 variables on grades, employment, income, and attitudes toward government from the hundreds of data items on the survey. Separate files for SPSS, Stata, and SAS can be downloaded from the repository. The NLS data is available for public use at https://www.nlsinfo.org/investigator/pages/search.

Let's start subsetting the data using pandas:

  1. We will start by loading the NLS data. We also set an index:
    import pandas as pd
    import numpy as np
    nls97 = pd.read_csv("data/nls97.csv")
    nls97.set_index("personid", inplace=True)
  2. Let's select a few columns from the NLS data. The following code creates a new DataFrame that contains some demographic and employment data. A useful feature of pandas is that the new DataFrame retains the index of the old DataFrame, as shown here:
    democols = ['gender','birthyear','maritalstatus',
     'weeksworked16','wageincome','highestdegree']
    nls97demo = nls97[democols]
    nls97demo.index.name
    'personid'
  3. We can use slicing to select rows by position. nls97demo[1000:1004] selects every row, starting from the row indicated by the integer to the left of the colon (1000, in this case) up to, but not including, the row indicated by the integer to the right of the colon (1004). The row at 1000 is the 1,001st row because of zero-based indexing. Each row appears as a column in the output since we have transposed the resulting DataFrame:
    nls97demo[1000:1004].T
    personid      195884       195891        195970\
    gender        Male         Male          Female
    birthyear     1981         1980          1982
    maritalstatus NaN          Never-married Never-married
    weeksworked16 NaN          53            53
    wageincome    NaN          14,000        52,000   
    highestdegree 4.Bachelors  2.High School 4.Bachelors
    personid       195996  
    gender         Female  
    birthyear      1980  
    maritalstatus  NaN  
    weeksworked16  NaN  
    wageincome     NaN
    highestdegree  3.Associates    
  4. We can also skip rows over the interval by setting a value for the step after the second colon. The default value for the step is 1. The value for the following step is 2, which means that every other row between 1000 and 1004 will be selected:
    nls97demo[1000:1004:2].T
    personid        195884       195970
    gender          Male         Female
    birthyear       1981         1982
    maritalstatus   NaN          Never-married
    weeksworked16   NaN          53
    wageincome      NaN          52,000
    highestdegree   4.Bachelors  4. Bachelors
  5. If we do not include a value to the left of the colon, row selection will start with the first row. Notice that this returns the same DataFrame as the head method does:
    nls97demo[:3].T
    personid       100061         100139          100284
    gender         Female         Male            Male
    birthyear      1980           1983            1984
    maritalstatus  Married        Married         Never-married
    weeksworked16  48             53              47
    wageincome     12,500         120,000         58,000
    highestdegree  2.High School  2. High School  0.None
    nls97demo.head(3).T
    personid       100061         100139         100284
    gender         Female         Male           Male
    birthyear      1980           1983           1984
    maritalstatus  Married        Married        Never-married
    weeksworked16  48             53             47
    wageincome     12,500         120,000        58,000
    highestdegree  2.High School  2.High School  0. None
  6. If we use a negative number, -n, to the left of the colon, the last n rows of the DataFrame will be returned. This returns the same DataFrame as the tail method does:
     nls97demo[-3:].T
    personid       999543          999698        999963
    gender         Female         Female         Female
    birthyear      1984           1983           1982
    maritalstatus  Divorced       Never-married  Married
    weeksworked16  0              0              53
    wageincome     NaN            NaN            50,000
    highestdegree  2.High School  2.High School  4. Bachelors
     nls97demo.tail(3).T
    personid       999543         999698         999963
    gender         Female         Female         Female
    birthyear      1984           1983           1982
    maritalstatus  Divorced       Never-married  Married
    weeksworked16  0              0              53
    wageincome     NaN            NaN            50,000
    highestdegree  2.High School  2.High School  4. Bachelors
  7. We can select rows by index value using the loc accessor. Recall that for the nls97demo DataFrame, the index is personid. We can pass a list of the index labels to the loc accessor, such as loc[[195884,195891,195970]], to get the rows associated with those labels. We can also pass a lower and upper bound of index labels, such as loc[195884:195970], to retrieve the indicated rows:
     nls97demo.loc[[195884,195891,195970]].T
    personid       195884       195891         195970
    gender         Male         Male           Female
    birthyear      1981         1980           1982
    maritalstatus  NaN          Never-married  Never-married
    weeksworked16  NaN          53             53
    wageincome     NaN          14,000         52,000
    highestdegree  4.Bachelors  2.High School  4.Bachelors
     nls97demo.loc[195884:195970].T
    personid       195884       195891         195970
    gender         Male         Male           Female
    birthyear      1981         1980           1982
    maritalstatus  NaN          Never-married  Never-married
    weeksworked16  NaN          53             53
    wageincome     NaN          14,000         52,000
    highestdegree  4.Bachelors  2.High School  4.Bachelors
  8. To select rows by position, rather than by index label, we can use the iloc accessor. We can pass a list of position numbers, such as iloc[[0,1,2]], to the accessor to get the rows at those positions. We can pass a range, such as iloc[0:3], to get rows between the lower and upper bound, not including the row at the upper bound. We can also use the iloc accessor to select the last n rows. iloc[-3:] selects the last three rows:
     nls97demo.iloc[[0,1,2]].T
    personid       100061         100139         100284
    gender         Female         Male           Male
    birthyear      1980           1983           1984
    maritalstatus  Married        Married        Never-married
    weeksworked16  48             53             47
    wageincome     12,500         120,000        58,000
    highestdegree  2.High School  2.High School  0. None
     nls97demo.iloc[0:3].T
    personid       100061         100139         100284
    gender         Female         Male           Male
    birthyear      1980           1983           1984
    maritalstatus  Married        Married        Never-married
    weeksworked16  48             53             47
    wageincome     12,500         120,000        58,000
    highestdegree  2.High School  2.High School  0. None
     nls97demo.iloc[-3:].T
    personid       999543         999698         999963
    gender         Female         Female         Female
    birthyear      1984           1983           1982
    maritalstatus  Divorced       Never-married  Married
    weeksworked16  0              0              53
    wageincome     NaN            NaN            50,000
    highestdegree  2.High School  2.High School  4. Bachelors

Often, we need to select rows based on a column value or the values of several columns. We can do this in pandas by using Boolean indexing. Here, we pass a vector of Boolean values (which can be a Series) to the loc accessor or the bracket operator. The Boolean vector needs to have the same index as the DataFrame.

  1. Let's try this using the nightlyhrssleep column on the NLS DataFrame. We want a Boolean Series that is True for people who sleep 6 or fewer hours a night (the 33rd percentile) and False if nightlyhrssleep is greater than 6 or is missing. sleepcheckbool = nls97.nightlyhrssleep<=lowsleepthreshold creates the boolean Series. If we display the first few values of sleepcheckbool, we will see that we are getting the expected values. We can also confirm that the sleepcheckbool index is equal to the nls97 index:
    nls97.nightlyhrssleep.head()
    personid
    100061     6
    100139     8
    100284     7
    100292     nan
    100583     6
    Name: nightlyhrssleep, dtype: float64
    lowsleepthreshold = nls97.nightlyhrssleep.quantile(0.33)
    lowsleepthreshold
    6.0
    sleepcheckbool = nls97.nightlyhrssleep<=lowsleepthreshold
    sleepcheckbool.head()
    personid
    100061    True
    100139    False
    100284    False
    100292    False
    100583    True
    Name: nightlyhrssleep, dtype: bool
    sleepcheckbool.index.equals(nls97.index)
    True

Since the sleepcheckbool Series has the same index as nls97, we can just pass it to the loc accessor to create a DataFrame containing people who sleep 6 hours or less a night. This is a little pandas magic here. It handles the index alignment for us:

lowsleep = nls97.loc[sleepcheckbool]
lowsleep.shape
(3067, 88)
  1. We could have created the lowsleep subset of our data in one step, which is what we would typically do unless we need the Boolean Series for some other purpose:
    lowsleep = nls97.loc[nls97.nightlyhrssleep<=lowsleepthreshold]
    lowsleep.shape
    (3067, 88)
  2. We can pass more complex conditions to the loc accessor and evaluate the values of multiple columns. For example, we can select rows where nightlyhrssleep is less than or equal to the threshold and childathome (number of children living at home) is greater than or equal to 3:
    lowsleep3pluschildren = \
      nls97.loc[(nls97.nightlyhrssleep<=lowsleepthreshold)
        & (nls97.childathome>=3)]
    lowsleep3pluschildren.shape
    (623, 88)

Each condition in nls97.loc[(nls97.nightlyhrssleep<=lowsleepthreshold) & (nls97.childathome>3)] is placed in parentheses. An error will be generated if the parentheses are excluded. The & operator is the equivalent of and in standard Python, meaning that both conditions have to be True for the row to be selected. We could have used | for or if we wanted to select the row if either condition was True.

  1. Finally, we can select rows and columns at the same time. The expression to the left of the comma selects rows, while the list to the right of the comma selects columns:
    lowsleep3pluschildren = \
      nls97.loc[(nls97.nightlyhrssleep<=lowsleepthreshold)
        & (nls97.childathome>=3),
        ['nightlyhrssleep','childathome']]
    lowsleep3pluschildren.shape
    (623, 2)

We used three different tools to select columns and rows from a pandas DataFrame in the last two sections: the [] bracket operator and two pandas-specific accessors, loc and iloc. This will be a little confusing if you are new to pandas, but it becomes clear which tool to use in which situation after just a few months. If you came to pandas with a fair bit of Python and NumPy experience, you will likely find the [] operator most familiar. However, the pandas documentation recommends against using the [] operator for production code. The loc accessor is used for selecting rows by Boolean indexing or by index label, while the iloc accessor is used for selecting rows by row number.

This section was a brief primer on selecting columns and rows with pandas. Although we did not go into too much detail on this, most of what you need to know to subset data was covered, as well as everything you need to know to understand the pandas-specific material in the rest of this book. We will start putting some of that to work in the next two sections by creating frequencies and summary statistics for our features.

Generating frequencies for categorical features

Categorical features can be either nominal or ordinal. Nominal features, such as gender, species name, or country, have a limited number of possible values, and are either strings or are numerical without having any intrinsic numerical meaning. For example, if country is represented by 1 for Afghanistan, 2 for Albania, and so on, the data is numerical but it does not make sense to perform arithmetic operations on those values.

Ordinal features also have a limited number of possible values but are different from nominal features in that the order of the values matters. A Likert scale rating (ranging from 1 for very unlikely to 5 for very likely) is an example of an ordinal feature. Nonetheless, arithmetic operations would not typically make sense because there is no uniform and meaningful distance between values.

Before we begin modeling, we want to have counts of all the possible values for the categorical features we may use. This is typically referred to as a one-way frequency distribution. Fortunately, pandas makes this very easy to do. We can quickly select columns from a pandas DataFrame and use the value_counts method to generate counts for each categorical value:

  1. Let's load the NLS data, create a DataFrame that contains just the first 20 columns of the data, and look at the data types:
    nls97 = pd.read_csv("data/nls97.csv")
    nls97.set_index("personid", inplace=True)
    nls97abb = nls97.iloc[:,:20]
    nls97abb.dtypes
    gender                   object
    birthmonth               int64
    birthyear                int64
    highestgradecompleted    float64
    maritalstatus            object
    childathome              float64
    childnotathome           float64
    wageincome               float64
    weeklyhrscomputer        object
    weeklyhrstv              object
    nightlyhrssleep          float64
    satverbal                float64
    satmath                  float64
    gpaoverall               float64
    gpaenglish               float64
    gpamath                  float64
    gpascience               float64
    highestdegree            object
    govprovidejobs           object
    govpricecontrols         object
    dtype: object

    Note

    Recall from the previous section how column and row selection works with the loc and iloc accessors. The colon to the left of the comma indicates that we want all the rows, while :20 to the right of the comma gets us the first 20 columns.

  2. All of the object type columns in the preceding code are categorical. We can use value_counts to see the counts for each value for maritalstatus. We can also use dropna=False to get value_counts to show the missing values (NaN):
    nls97abb.maritalstatus.value_counts(dropna=False)
    Married          3066
    Never-married    2766
    NaN              2312
    Divorced         663
    Separated        154
    Widowed          23
    Name: maritalstatus, dtype: int64
  3. If we just want the number of missing values, we can chain the isnull and sum methods. isnull returns a Boolean Series containing True values when maritalstatus is missing and False otherwise. sum then counts the number of True values, since it will interpret True values as 1 and False values as 0:
    nls97abb.maritalstatus.isnull().sum()
    2312
  4. You have probably noticed that the maritalstatus values were sorted by frequency by default. You can sort them alphabetically by values by sorting the index. We can do this by taking advantage of the fact that value_counts returns a Series with the values as the index:
    marstatcnt = nls97abb.maritalstatus.value_counts(dropna=False)
    type(marstatcnt)
    <class 'pandas.core.series.Series'>
    marstatcnt.index
    Index(['Married', 'Never-married', nan, 'Divorced', 'Separated', 'Widowed'], dtype='object')
  5. To sort the index, we just need to call sort_index:
    marstatcnt.sort_index()
    Divorced         663
    Married          3066
    Never-married    2766
    Separated        154
    Widowed          23
    NaN              2312
    Name: maritalstatus, dtype: int64
  6. Of course, we could have gotten the same results in one step with nls97.maritalstatus.value_counts(dropna=False).sort_index(). We can also show ratios instead of counts by setting normalize to True. In the following code, we can see that 34% of the responses were Married (notice that we did not set dropna to True, so missing values have been excluded):
    nls97.maritalstatus.\
      value_counts(normalize=True, dropna=False).\
         sort_index()
     
    Divorced             0.07
    Married              0.34
    Never-married        0.31
    Separated            0.02
    Widowed              0.00
    NaN                  0.26
    Name: maritalstatus, dtype: float64
  7. pandas has a category data type that can store data much more efficiently than the object data type when a column has a limited number of values. Since we already know that all of our object columns contain categorical data, we should convert those columns into the category data type. In the following code, we're creating a list that contains the column names for the object columns, catcols. Then, we're looping through those columns and using astype to change the data type to category:
    catcols = nls97abb.select_dtypes(include=["object"]).columns
    for col in nls97abb[catcols].columns:
    ...      nls97abb[col] = nls97abb[col].astype('category')
    ... 
    nls97abb[catcols].dtypes
    gender                   category
    maritalstatus            category
    weeklyhrscomputer        category
    weeklyhrstv              category
    highestdegree            category
    govprovidejobs           category
    govpricecontrols         category
    dtype: object
  8. Let's check our category features for missing values. There are no missing values for gender and very few for highestdegree. But the overwhelming majority of values for govprovidejobs (the government should provide jobs) and govpricecontrols (the government should control prices) are missing. This means that those features probably won't be useful for most modeling:
    nls97abb[catcols].isnull().sum()
    gender               0
    maritalstatus        2312
    weeklyhrscomputer    2274
    weeklyhrstv          2273
    highestdegree        31
    govprovidejobs       7151
    govpricecontrols     7125
    dtype: int64
  9. We can generate frequencies for multiple features at once by passing a value_counts call to apply. We can use filter to select the columns that we want – in this case, all the columns with gov in their name. Note that the missing values for each feature have been omitted since we did not set dropna to False:
     nls97abb.filter(like="gov").apply(pd.value_counts, normalize=True)
                     govprovidejobs    govpricecontrols
    1. Definitely              0.25                0.54
    2. Probably                0.34                0.33
    3. Probably not            0.25                0.09
    4. Definitely not          0.16                0.04
  10. We can use the same frequencies on a subset of our data. If, for example, we want to see the responses of only married people to the government role questions, we can do that subsetting by placing nls97abb[nls97abb.maritalstatus=="Married"] before filter:
     nls97abb.loc[nls97abb.maritalstatus=="Married"].\
     filter(like="gov").\
       apply(pd.value_counts, normalize=True)
                     govprovidejobs    govpricecontrols
    1. Definitely              0.17                0.46
    2. Probably                0.33                0.38
    3. Probably not            0.31                0.11
    4. Definitely not          0.18                0.05
  11. Since, in this case, there were only two gov columns, it may have been easier to do the following:
     nls97abb.loc[nls97abb.maritalstatus=="Married",
       ['govprovidejobs','govpricecontrols']].\
       apply(pd.value_counts, normalize=True)
                      govprovidejobs     govpricecontrols
    1. Definitely               0.17                 0.46
    2. Probably                 0.33                 0.38
    3. Probably not             0.31                 0.11
    4. Definitely not           0.18                 0.05

Nonetheless, it will often be easier to use filter since it is not unusual to have to do the same cleaning or exploration task on groups of features with similar names.

There are times when we may want to model a continuous or discrete feature as categorical. The NLS DataFrame contains highestgradecompleted. A year increase from 5 to 6 may not be as important as that from 11 to 12 in terms of its impact on a target. Let's create a dichotomous feature instead – that is, 1 when the person has completed 12 or more grades, 0 if they have completed less than that, and missing when highestgradecompleted is missing.

  1. We need to do a little bit of cleaning up first, though. highestgradecompleted has two logical missing values – an actual NaN value that pandas recognizes as missing and a 95 value that the survey designers intend for us to also treat as missing for most use cases. Let's use replace to fix that before moving on:
    nls97abb.highestgradecompleted.\
      replace(95, np.nan, inplace=True)
  2. We can use NumPy's where function to assign values to highschoolgrad based on the values of highestgradecompleted. If highestgradecompleted is null (NaN), we assign NaN to our new column, highschoolgrad. If the value for highestgradecompleted is not null, the next clause tests for a value less than 12, setting highschoolgrad to 0 if that is true, and to 1 otherwise. We can confirm that the new column, highschoolgrad, contains the values we want by using groupby to get the min and max values of highestgradecompleted at each level of highschoolgrad:
    nls97abb['highschoolgrad'] = \
      np.where(nls97abb.highestgradecompleted.isnull(),np.nan, \
      np.where(nls97abb.highestgradecompleted<12,0,1))
     
    nls97abb.groupby(['highschoolgrad'], dropna=False) \
      ['highestgradecompleted'].agg(['min','max','size'])
                      min       max       size
    highschoolgrad                
    0                   5        11       1231
    1                  12        20       5421
    nan               nan       nan       2332
     nls97abb['highschoolgrad'] = \
    ...  nls97abb['highschoolgrad'].astype('category')

While 12 makes conceptual sense as the threshold for classifying our new feature, highschoolgrad, this would present some modeling challenges if we intended to use highschoolgrad as a target. There is a pretty substantial class imbalance, with highschoolgrad equal to 1 class being more than 4 times the size of the 0 group. We should explore using more groups to represent highestgradecompleted.

  1. One way to do this with pandas is with the qcut function. We can set the q parameter of qcut to 6 to create six groups that are as evenly distributed as possible. These groups are now closer to being balanced:
    nls97abb['highgradegroup'] = \
      pd.qcut(nls97abb['highestgradecompleted'], 
       q=6, labels=[1,2,3,4,5,6])
     
    nls97abb.groupby(['highgradegroup'])['highestgradecompleted'].\
        agg(['min','max','size'])
                      min         max      size
    highgradegroup                
    1                   5          11       1231
    2                  12          12       1389
    3                  13          14       1288
    4                  15          16       1413
    5                  17          17        388
    6                  18          20        943
    nls97abb['highgradegroup'] = \
        nls97abb['highgradegroup'].astype('category')
  2. Finally, I typically find it helpful to generate frequencies for all the categorical features and save that output so that I can refer to it later. I rerun that code whenever I make some change to the data that may change these frequencies. The following code iterates over all the columns that are of the category data type and runs value_counts:
     freqout = open('views/frequencies.txt', 'w') 
     for col in nls97abb.select_dtypes(include=["category"]):
          print(col, "----------------------",
            "frequencies",
          nls97abb[col].value_counts(dropna=False).sort_index(),
            "percentages",
          nls97abb[col].value_counts(normalize=True).\
            sort_index(),
          sep="\n\n", end="\n\n\n", file=freqout)
     
     freqout.close()

These are the key techniques for generating one-way frequencies for the categorical features in your data. The real star of the show has been the value_counts method. We can use value_counts to create frequencies a Series at a time, use it with apply for multiple columns, or iterate over several columns and call value_counts each time. We have looked at examples of each in this section. Next, let's explore some techniques for examining the distribution of continuous features.

Generating summary statistics for continuous and discrete features

Getting a feel for the distribution of continuous or discrete features is a little more complicated than it is for categorical features. A continuous feature can take an infinite number of values. An example of a continuous feature is weight, as someone can weigh 70 kilograms, or 70.1, or 70.01. Discrete features have a finite number of values, such as the number of birds sighted, or the number of apples purchased. One way of thinking about the difference is that a discrete feature is typically something that has been counted, while a continuous feature is usually captured by measurement, weighing, or timekeeping.

Continuous features will generally be stored as floating-point numbers unless they have been constrained to be whole numbers. In that case, they may be stored as integers. Age for individual humans, for example, is continuous but is usually truncated to an integer.

For most modeling purposes, continuous and discrete features are treated similarly. We would not model age as a categorical feature. We assume that the interval between ages has largely the same meaning between 25 and 26 as it has between 35 and 36, though this breaks down at the extremes. The interval between 1 and 2 years of age for humans is not at all like that between 71 and 72. Data analysts and scientists are usually skeptical of assumed linear relationships between continuous features and targets, though modeling is much easier when that is true.

To understand how a continuous feature (or discrete feature) is distributed, we must examine its central tendency, shape, and spread. Key summary statistics are mean and median for central tendency, skewness and kurtosis for shape, and range, interquartile range, variance, and standard deviation for spread. In this section, we will learn how to use pandas, supplemented by the SciPy library, to get these statistics. We will also discuss important implications for modeling.

We will work with COVID-19 data in this section. The dataset contains one row per country, with total cases and deaths through June 2021, as well as demographic data for each country.

Note

Our World in Data provides COVID-19 public use data at https://ourworldindata.org/coronavirus-source-data. The data that will be used in this section was downloaded on July 9, 2021. There are more columns in the data than I have included. I created the region column based on country.

Follow these steps to generate the summary statistics:

  1. Let's load the COVID .csv file into pandas, set the index, and look at the data. There are 221 rows and 16 columns. The index we set, iso_code, contains a unique value for each row. We use sample to view two countries randomly, rather than the first two (we set a value for random_state to get the same results each time we run the code):
    import pandas as pd
    import numpy as np
    import scipy.stats as scistat
    covidtotals = pd.read_csv("data/covidtotals.csv",
        parse_dates=['lastdate'])
    covidtotals.set_index("iso_code", inplace=True)
    covidtotals.shape
    (221, 16)
    covidtotals.index.nunique()
    221
    covidtotals.sample(2, random_state=6).T
    iso_code                         ISL               CZE
    lastdate                  2021-07-07        2021-07-07
    location                     Iceland           Czechia
    total_cases                    6,555         1,668,277
    total_deaths                      29            30,311
    total_cases_mill              19,209           155,783
    total_deaths_mill                 85             2,830
    population                   341,250        10,708,982
    population_density                 3               137
    median_age                        37                43
    gdp_per_capita                46,483            32,606
    aged_65_older                     14                19
    total_tests_thous                NaN               NaN
    life_expectancy                   83                79
    hospital_beds_thous                3                 7
    diabetes_prevalence                5                 7
    region                Western Europe    Western Europe

Just by looking at these two rows, we can see significant differences in cases and deaths between Iceland and Czechia, even in terms of population size. (total_cases_mill and total_deaths_mill divide cases and deaths per million of population, respectively.) Data analysts are very used to wondering if there is anything else in the data that may explain substantially higher cases and deaths in Czechia than in Iceland. In a sense, we are always engaging in feature selection.

  1. Let's take a look at the data types and number of non-null values for each column. Almost all of the columns are continuous or discrete. We have data on cases and deaths, as well as likely targets, for 192 and 185 rows, respectively. An important data cleaning task we'll have to do will be figuring out what, if anything, we can do about countries that have missing values for our targets. We'll discuss how to handle missing values later:
    covidtotals.info()
    <class 'pandas.core.frame.DataFrame'>
    Index: 221 entries, AFG to ZWE
    Data columns (total 16 columns):
     #   Column             Non-Null Count         Dtype 
    ---  -------            --------------  --------------
     0   lastdate             221 non-null  datetime64[ns]
     1   location             221 non-null          object
     2   total_cases          192 non-null         float64
     3   total_deaths         185 non-null         float64
     4   total_cases_mill     192 non-null         float64
     5   total_deaths_mill    185 non-null         float64
     6   population           221 non-null         float64
     7   population_density   206 non-null         float64
     8   median_age           190 non-null         float64
     9   gdp_per_capita       193 non-null         float64
     10  aged_65_older        188 non-null         float64
     11  total_tests_thous     13 non-null         float64
     12  life_expectancy      217 non-null         float64
     13  hospital_beds_thous  170 non-null         float64
     14  diabetes_prevalence  200 non-null         float64
     15  region               221 non-null          object
    dtypes: datetime64[ns](1), float64(13), object(2)
    memory usage: 29.4+ KB
  2. Now, we are ready to examine the distribution of some of the features. We can get most of the summary statistics we want by using the describe method. The mean and median (50%) are good indicators of the center of the distribution, each with its strengths. It is also good to notice substantial differences between the mean and median, as an indication of skewness. For example, we can see that the mean cases per million is almost twice the median, with 36.7 thousand compared to 19.5 thousand. This is a clear indicator of positive skew. This is also true for deaths per million.

The interquartile range is also quite large for cases and deaths, with the 75th percentile value being about 25 times larger than the 25th percentile value in both cases. We can compare that with the percentage of the population aged 65 and older and diabetes prevalence, where the 75th percentile is just four times or two times that of the 25th percentile, respectively. We can tell right away that those two possible features (aged_65_older and diabetes_prevalence) would have to do a lot of work to explain the huge variance in our targets:

 keyvars = ['location','total_cases_mill','total_deaths_mill',
...  'aged_65_older','diabetes_prevalence']
 covidkeys = covidtotals[keyvars]
 covidkeys.describe()
total_cases_mill total_deaths_mill aged_65_older diabetes_prevalence
count        192.00       185.00    188.00     200.00
mean      36,649.37       683.14      8.61       8.44
std       41,403.98       861.73      6.12       4.89
min            8.52         0.35      1.14       0.99
25%        2,499.75        43.99      3.50       5.34
50%       19,525.73       293.50      6.22       7.20
75%       64,834.62     1,087.89     13.92      10.61
max      181,466.38     5,876.01     27.05      30.53
  1. I sometimes find it helpful to look at the decile values to get a better sense of the distribution. The quantile method can take a single value for quantile, such as quantile(0.25) for the 25th percentile, or a list or tuple, such as quantile((0.25,0.5)) for the 25th and 50th percentiles. In the following code, we're using arange from NumPy (np.arange(0.0, 1.1, 0.1)) to generate an array that goes from 0.0 to 1.0 with a 0.1 increment. We would get the same result if we were to use covidkeys.quantile([0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]):
     covidkeys.quantile(np.arange(0.0, 1.1, 0.1))
          total_cases_mill  total_deaths_mill  aged_65_older  diabetes_prevalence
    0.00         8.52       0.35       1.14     0.99
    0.10       682.13      10.68       2.80     3.30
    0.20     1,717.39      30.22       3.16     4.79
    0.30     3,241.84      66.27       3.86     5.74
    0.40     9,403.58      145.06      4.69     6.70
    0.50     19,525.73     293.50      6.22     7.20
    0.60     33,636.47     556.43      7.93     8.32
    0.70     55,801.33     949.71     11.19    10.08
    0.80     74,017.81    1,333.79    14.92    11.62
    0.90     94,072.18    1,868.89    18.85    13.75
    1.00    181,466.38    5,876.01    27.05    30.53

For cases, deaths, and diabetes prevalence, much of the range (the distance between the min and max values) is in the last 10% of the distribution. This is particularly true for deaths. This hints at possible modeling problems and invites us to take a close look at outliers, something we will do in the next section.

  1. Some machine learning algorithms assume that our features have normal (also referred to as Gaussian) distributions, that they are distributed symmetrically (have low skew), and that they have relatively normal tails (neither excessively high nor excessively low kurtosis). The statistics we have seen so far already suggest a high positive skew for our two likely targets – that is, total cases and deaths per million people in the population. Let's put a finer point on this by calculating both skew and kurtosis for some of the features. For a Gaussian distribution, we expect a value near 0 for skew and 3 for kurtosis. total_deaths_mill has values for skew and kurtosis that are worth noting, and the total_cases_mill and aged_65_older features have excessively low kurtosis (skinny tails):
    covidkeys.skew()
    total_cases_mill        1.21
    total_deaths_mill       2.00
    aged_65_older           0.84
    diabetes_prevalence     1.52
    dtype: float64
     covidkeys.kurtosis()
    total_cases_mill        0.91
    total_deaths_mill       6.58
    aged_65_older          -0.56
    diabetes_prevalence     3.31
    dtype: float64
  2. We can also explicitly test each distribution's normality by looping over the features in the keyvars list and running a Shapiro-Wilk test on the distribution (scistat.shapiro(covidkeys[var].dropna())). Notice that we need to drop missing values with dropna for the test to run. p-values less than 0.05 indicate that we can reject the null hypothesis of normal, which is the case for each of the four features:
    for var in keyvars[1:]:
          stat, p = scistat.shapiro(covidkeys[var].dropna())
          print("feature=", var, "     p-value=", '{:.6f}'.format(p))
     
    feature= total_cases_mill       p-value= 0.000000
    feature= total_deaths_mill      p-value= 0.000000
    feature= aged_65_older          p-value= 0.000000
    feature= diabetes_prevalence    p-value= 0.000000

These results should make us pause if we are considering parametric models such as linear regression. None of the distributions approximates a normal distribution. However, this is not determinative. It is not as simple as deciding that we should use certain models when we have normally distributed features and non-parametric models (say, k-nearest neighbors) when we do not.

We want to do additional data cleaning before we make any modeling decisions. For example, we may decide to remove outliers or determine that it is appropriate to transform the data. We will explore transformations, such as log and polynomial transformations, in several chapters in this book.

This section showed you how to use pandas and SciPy to understand how continuous and discrete features are distributed, including their central tendency, shape, and spread. It makes sense to generate these statistics for any feature or target that might be included in our modeling. This also points us in the direction of more work we need to do to prepare our data for analysis. We need to identify missing values and outliers and figure out how we will handle them. We should also visualize the distribution of our continuous features. This rarely fails to yield additional insights. We will learn how to identify outliers in the next section and create visualizations in the following section.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Learn how to prepare data for machine learning processes
  • Understand which algorithms are based on prediction objectives and the properties of the data
  • Explore how to interpret and evaluate the results from machine learning

Description

Many individuals who know how to run machine learning algorithms do not have a good sense of the statistical assumptions they make and how to match the properties of the data to the algorithm for the best results. As you start with this book, models are carefully chosen to help you grasp the underlying data, including in-feature importance and correlation, and the distribution of features and targets. The first two parts of the book introduce you to techniques for preparing data for ML algorithms, without being bashful about using some ML techniques for data cleaning, including anomaly detection and feature selection. The book then helps you apply that knowledge to a wide variety of ML tasks. You’ll gain an understanding of popular supervised and unsupervised algorithms, how to prepare data for them, and how to evaluate them. Next, you’ll build models and understand the relationships in your data, as well as perform cleaning and exploration tasks with that data. You’ll make quick progress in studying the distribution of variables, identifying anomalies, and examining bivariate relationships, as you focus more on the accuracy of predictions in this book. By the end of this book, you’ll be able to deal with complex data problems using unsupervised ML algorithms like principal component analysis and k-means clustering.

Who is this book for?

This book is for professional data scientists, particularly those in the first few years of their career, or more experienced analysts who are relatively new to machine learning. Readers should have prior knowledge of concepts in statistics typically taught in an undergraduate introductory course as well as beginner-level experience in manipulating data programmatically.

What you will learn

  • Explore essential data cleaning and exploration techniques to be used before running the most popular machine learning algorithms
  • Understand how to perform preprocessing and feature selection, and how to set up the data for testing and validation
  • Model continuous targets with supervised learning algorithms
  • Model binary and multiclass targets with supervised learning algorithms
  • Execute clustering and dimension reduction with unsupervised learning algorithms
  • Understand how to use regression trees to model a continuous target

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Aug 26, 2022
Length: 542 pages
Edition : 1st
Language : English
ISBN-13 : 9781803245911

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Aug 26, 2022
Length: 542 pages
Edition : 1st
Language : English
ISBN-13 : 9781803245911

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 141.97
Data Cleaning and Exploration with Machine Learning
$41.99
Machine Learning Techniques for Text
$46.99
Modern Time Series Forecasting with Python
$52.99
Total $ 141.97 Stars icon

Table of Contents

22 Chapters
Section 1 – Data Cleaning and Machine Learning Algorithms Chevron down icon Chevron up icon
Chapter 1: Examining the Distribution of Features and Targets Chevron down icon Chevron up icon
Chapter 2: Examining Bivariate and Multivariate Relationships between Features and Targets Chevron down icon Chevron up icon
Chapter 3: Identifying and Fixing Missing Values Chevron down icon Chevron up icon
Section 2 – Preprocessing, Feature Selection, and Sampling Chevron down icon Chevron up icon
Chapter 4: Encoding, Transforming, and Scaling Features Chevron down icon Chevron up icon
Chapter 5: Feature Selection Chevron down icon Chevron up icon
Chapter 6: Preparing for Model Evaluation Chevron down icon Chevron up icon
Section 3 – Modeling Continuous Targets with Supervised Learning Chevron down icon Chevron up icon
Chapter 7: Linear Regression Models Chevron down icon Chevron up icon
Chapter 8: Support Vector Regression Chevron down icon Chevron up icon
Chapter 9: K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression Chevron down icon Chevron up icon
Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning Chevron down icon Chevron up icon
Chapter 10: Logistic Regression Chevron down icon Chevron up icon
Chapter 11: Decision Trees and Random Forest Classification Chevron down icon Chevron up icon
Chapter 12: K-Nearest Neighbors for Classification Chevron down icon Chevron up icon
Chapter 13: Support Vector Machine Classification Chevron down icon Chevron up icon
Chapter 14: Naïve Bayes Classification Chevron down icon Chevron up icon
Section 5 – Clustering and Dimensionality Reduction with Unsupervised Learning Chevron down icon Chevron up icon
Chapter 15: Principal Component Analysis Chevron down icon Chevron up icon
Chapter 16: K-Means and DBSCAN Clustering Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3
(9 Ratings)
5 star 55.6%
4 star 33.3%
3 star 0%
2 star 11.1%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Yiqiao Yin Sep 02, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Many individuals who know how to run machine learning algorithms do not have a good sense of the statistical assumptions they make and how to match the properties of the data to the algorithm for the best results.
Amazon Verified review Amazon
Lance Oct 14, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is amazingly comprehensive and clear in how to work with data and manage it effectively and ethically to be prepared for machine learning. Walker takes readers through each step with accessible language that feels more suggestive or open to consideration by the readers; in that way, it feels conversational rather than directional and that he is along for the deep dive into machine learning with the reader. He also provides various tips and insights that can anticipate issues or problems that the reader may run into.
Amazon Verified review Amazon
Kedeisha B. Oct 29, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This is hands down one of the best books I have gone through. So much hands on exercises with thoughtful explanations.What I enjoyed about this book is that is teaches what I find missing in many popular courses. The data cleaning and exploration techniques to prepare your data to best fit the statistical assumptions of a model that achieves optimal results.Rather than provide every data cleaning or feature selection technique known to man, it teaches the 20% that drive 80% of value.This is an amazing book that comes with a great GitHub repository where you can code along with each chapter. If you are looking to transition into a Data Science role or in the earlier stages of your career, don't walk, but run and get this book.
Amazon Verified review Amazon
Jay Oct 18, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book clearly explains all the essential concepts of data cleaning and gives very good insight on machine learning algorithms. You also get all the example code which you can play with. People who want to explore the ML AI domain are often intimidated by the amount of resources they find online. This book gives you solid fundamentals and doesn’t distract you from your goal.
Amazon Verified review Amazon
Adam Bush Sep 24, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The author makes clear at the start that anyone can find themselves in this text. And it's true! This was not my field but I found myself turning the pages as I would a novel. The author is a teacher at heart-- at that comes across throughout. I was not talked at nor talked down to, but, instead, guided towards a new understanding of the iterative processes and possibilities of predictive analysis.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.