Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Cleaning and Exploration with Machine Learning

You're reading from   Data Cleaning and Exploration with Machine Learning Get to grips with machine learning techniques to achieve sparkling-clean data quickly

Arrow left icon
Product type Paperback
Published in Aug 2022
Publisher Packt
ISBN-13 9781803241678
Length 542 pages
Edition 1st Edition
Arrow right icon
Author (1):
Arrow left icon
Michael Walker Michael Walker
Author Profile Icon Michael Walker
Michael Walker
Arrow right icon
View More author details
Toc

Table of Contents (23) Chapters Close

Preface 1. Section 1 – Data Cleaning and Machine Learning Algorithms
2. Chapter 1: Examining the Distribution of Features and Targets FREE CHAPTER 3. Chapter 2: Examining Bivariate and Multivariate Relationships between Features and Targets 4. Chapter 3: Identifying and Fixing Missing Values 5. Section 2 – Preprocessing, Feature Selection, and Sampling
6. Chapter 4: Encoding, Transforming, and Scaling Features 7. Chapter 5: Feature Selection 8. Chapter 6: Preparing for Model Evaluation 9. Section 3 – Modeling Continuous Targets with Supervised Learning
10. Chapter 7: Linear Regression Models 11. Chapter 8: Support Vector Regression 12. Chapter 9: K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression 13. Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning
14. Chapter 10: Logistic Regression 15. Chapter 11: Decision Trees and Random Forest Classification 16. Chapter 12: K-Nearest Neighbors for Classification 17. Chapter 13: Support Vector Machine Classification 18. Chapter 14: Naïve Bayes Classification 19. Section 5 – Clustering and Dimensionality Reduction with Unsupervised Learning
20. Chapter 15: Principal Component Analysis 21. Chapter 16: K-Means and DBSCAN Clustering 22. Other Books You May Enjoy

Generating frequencies for categorical features

Categorical features can be either nominal or ordinal. Nominal features, such as gender, species name, or country, have a limited number of possible values, and are either strings or are numerical without having any intrinsic numerical meaning. For example, if country is represented by 1 for Afghanistan, 2 for Albania, and so on, the data is numerical but it does not make sense to perform arithmetic operations on those values.

Ordinal features also have a limited number of possible values but are different from nominal features in that the order of the values matters. A Likert scale rating (ranging from 1 for very unlikely to 5 for very likely) is an example of an ordinal feature. Nonetheless, arithmetic operations would not typically make sense because there is no uniform and meaningful distance between values.

Before we begin modeling, we want to have counts of all the possible values for the categorical features we may use. This is typically referred to as a one-way frequency distribution. Fortunately, pandas makes this very easy to do. We can quickly select columns from a pandas DataFrame and use the value_counts method to generate counts for each categorical value:

  1. Let's load the NLS data, create a DataFrame that contains just the first 20 columns of the data, and look at the data types:
    nls97 = pd.read_csv("data/nls97.csv")
    nls97.set_index("personid", inplace=True)
    nls97abb = nls97.iloc[:,:20]
    nls97abb.dtypes
    gender                   object
    birthmonth               int64
    birthyear                int64
    highestgradecompleted    float64
    maritalstatus            object
    childathome              float64
    childnotathome           float64
    wageincome               float64
    weeklyhrscomputer        object
    weeklyhrstv              object
    nightlyhrssleep          float64
    satverbal                float64
    satmath                  float64
    gpaoverall               float64
    gpaenglish               float64
    gpamath                  float64
    gpascience               float64
    highestdegree            object
    govprovidejobs           object
    govpricecontrols         object
    dtype: object

    Note

    Recall from the previous section how column and row selection works with the loc and iloc accessors. The colon to the left of the comma indicates that we want all the rows, while :20 to the right of the comma gets us the first 20 columns.

  2. All of the object type columns in the preceding code are categorical. We can use value_counts to see the counts for each value for maritalstatus. We can also use dropna=False to get value_counts to show the missing values (NaN):
    nls97abb.maritalstatus.value_counts(dropna=False)
    Married          3066
    Never-married    2766
    NaN              2312
    Divorced         663
    Separated        154
    Widowed          23
    Name: maritalstatus, dtype: int64
  3. If we just want the number of missing values, we can chain the isnull and sum methods. isnull returns a Boolean Series containing True values when maritalstatus is missing and False otherwise. sum then counts the number of True values, since it will interpret True values as 1 and False values as 0:
    nls97abb.maritalstatus.isnull().sum()
    2312
  4. You have probably noticed that the maritalstatus values were sorted by frequency by default. You can sort them alphabetically by values by sorting the index. We can do this by taking advantage of the fact that value_counts returns a Series with the values as the index:
    marstatcnt = nls97abb.maritalstatus.value_counts(dropna=False)
    type(marstatcnt)
    <class 'pandas.core.series.Series'>
    marstatcnt.index
    Index(['Married', 'Never-married', nan, 'Divorced', 'Separated', 'Widowed'], dtype='object')
  5. To sort the index, we just need to call sort_index:
    marstatcnt.sort_index()
    Divorced         663
    Married          3066
    Never-married    2766
    Separated        154
    Widowed          23
    NaN              2312
    Name: maritalstatus, dtype: int64
  6. Of course, we could have gotten the same results in one step with nls97.maritalstatus.value_counts(dropna=False).sort_index(). We can also show ratios instead of counts by setting normalize to True. In the following code, we can see that 34% of the responses were Married (notice that we did not set dropna to True, so missing values have been excluded):
    nls97.maritalstatus.\
      value_counts(normalize=True, dropna=False).\
         sort_index()
     
    Divorced             0.07
    Married              0.34
    Never-married        0.31
    Separated            0.02
    Widowed              0.00
    NaN                  0.26
    Name: maritalstatus, dtype: float64
  7. pandas has a category data type that can store data much more efficiently than the object data type when a column has a limited number of values. Since we already know that all of our object columns contain categorical data, we should convert those columns into the category data type. In the following code, we're creating a list that contains the column names for the object columns, catcols. Then, we're looping through those columns and using astype to change the data type to category:
    catcols = nls97abb.select_dtypes(include=["object"]).columns
    for col in nls97abb[catcols].columns:
    ...      nls97abb[col] = nls97abb[col].astype('category')
    ... 
    nls97abb[catcols].dtypes
    gender                   category
    maritalstatus            category
    weeklyhrscomputer        category
    weeklyhrstv              category
    highestdegree            category
    govprovidejobs           category
    govpricecontrols         category
    dtype: object
  8. Let's check our category features for missing values. There are no missing values for gender and very few for highestdegree. But the overwhelming majority of values for govprovidejobs (the government should provide jobs) and govpricecontrols (the government should control prices) are missing. This means that those features probably won't be useful for most modeling:
    nls97abb[catcols].isnull().sum()
    gender               0
    maritalstatus        2312
    weeklyhrscomputer    2274
    weeklyhrstv          2273
    highestdegree        31
    govprovidejobs       7151
    govpricecontrols     7125
    dtype: int64
  9. We can generate frequencies for multiple features at once by passing a value_counts call to apply. We can use filter to select the columns that we want – in this case, all the columns with gov in their name. Note that the missing values for each feature have been omitted since we did not set dropna to False:
     nls97abb.filter(like="gov").apply(pd.value_counts, normalize=True)
                     govprovidejobs    govpricecontrols
    1. Definitely              0.25                0.54
    2. Probably                0.34                0.33
    3. Probably not            0.25                0.09
    4. Definitely not          0.16                0.04
  10. We can use the same frequencies on a subset of our data. If, for example, we want to see the responses of only married people to the government role questions, we can do that subsetting by placing nls97abb[nls97abb.maritalstatus=="Married"] before filter:
     nls97abb.loc[nls97abb.maritalstatus=="Married"].\
     filter(like="gov").\
       apply(pd.value_counts, normalize=True)
                     govprovidejobs    govpricecontrols
    1. Definitely              0.17                0.46
    2. Probably                0.33                0.38
    3. Probably not            0.31                0.11
    4. Definitely not          0.18                0.05
  11. Since, in this case, there were only two gov columns, it may have been easier to do the following:
     nls97abb.loc[nls97abb.maritalstatus=="Married",
       ['govprovidejobs','govpricecontrols']].\
       apply(pd.value_counts, normalize=True)
                      govprovidejobs     govpricecontrols
    1. Definitely               0.17                 0.46
    2. Probably                 0.33                 0.38
    3. Probably not             0.31                 0.11
    4. Definitely not           0.18                 0.05

Nonetheless, it will often be easier to use filter since it is not unusual to have to do the same cleaning or exploration task on groups of features with similar names.

There are times when we may want to model a continuous or discrete feature as categorical. The NLS DataFrame contains highestgradecompleted. A year increase from 5 to 6 may not be as important as that from 11 to 12 in terms of its impact on a target. Let's create a dichotomous feature instead – that is, 1 when the person has completed 12 or more grades, 0 if they have completed less than that, and missing when highestgradecompleted is missing.

  1. We need to do a little bit of cleaning up first, though. highestgradecompleted has two logical missing values – an actual NaN value that pandas recognizes as missing and a 95 value that the survey designers intend for us to also treat as missing for most use cases. Let's use replace to fix that before moving on:
    nls97abb.highestgradecompleted.\
      replace(95, np.nan, inplace=True)
  2. We can use NumPy's where function to assign values to highschoolgrad based on the values of highestgradecompleted. If highestgradecompleted is null (NaN), we assign NaN to our new column, highschoolgrad. If the value for highestgradecompleted is not null, the next clause tests for a value less than 12, setting highschoolgrad to 0 if that is true, and to 1 otherwise. We can confirm that the new column, highschoolgrad, contains the values we want by using groupby to get the min and max values of highestgradecompleted at each level of highschoolgrad:
    nls97abb['highschoolgrad'] = \
      np.where(nls97abb.highestgradecompleted.isnull(),np.nan, \
      np.where(nls97abb.highestgradecompleted<12,0,1))
     
    nls97abb.groupby(['highschoolgrad'], dropna=False) \
      ['highestgradecompleted'].agg(['min','max','size'])
                      min       max       size
    highschoolgrad                
    0                   5        11       1231
    1                  12        20       5421
    nan               nan       nan       2332
     nls97abb['highschoolgrad'] = \
    ...  nls97abb['highschoolgrad'].astype('category')

While 12 makes conceptual sense as the threshold for classifying our new feature, highschoolgrad, this would present some modeling challenges if we intended to use highschoolgrad as a target. There is a pretty substantial class imbalance, with highschoolgrad equal to 1 class being more than 4 times the size of the 0 group. We should explore using more groups to represent highestgradecompleted.

  1. One way to do this with pandas is with the qcut function. We can set the q parameter of qcut to 6 to create six groups that are as evenly distributed as possible. These groups are now closer to being balanced:
    nls97abb['highgradegroup'] = \
      pd.qcut(nls97abb['highestgradecompleted'], 
       q=6, labels=[1,2,3,4,5,6])
     
    nls97abb.groupby(['highgradegroup'])['highestgradecompleted'].\
        agg(['min','max','size'])
                      min         max      size
    highgradegroup                
    1                   5          11       1231
    2                  12          12       1389
    3                  13          14       1288
    4                  15          16       1413
    5                  17          17        388
    6                  18          20        943
    nls97abb['highgradegroup'] = \
        nls97abb['highgradegroup'].astype('category')
  2. Finally, I typically find it helpful to generate frequencies for all the categorical features and save that output so that I can refer to it later. I rerun that code whenever I make some change to the data that may change these frequencies. The following code iterates over all the columns that are of the category data type and runs value_counts:
     freqout = open('views/frequencies.txt', 'w') 
     for col in nls97abb.select_dtypes(include=["category"]):
          print(col, "----------------------",
            "frequencies",
          nls97abb[col].value_counts(dropna=False).sort_index(),
            "percentages",
          nls97abb[col].value_counts(normalize=True).\
            sort_index(),
          sep="\n\n", end="\n\n\n", file=freqout)
     
     freqout.close()

These are the key techniques for generating one-way frequencies for the categorical features in your data. The real star of the show has been the value_counts method. We can use value_counts to create frequencies a Series at a time, use it with apply for multiple columns, or iterate over several columns and call value_counts each time. We have looked at examples of each in this section. Next, let's explore some techniques for examining the distribution of continuous features.

You have been reading a chapter from
Data Cleaning and Exploration with Machine Learning
Published in: Aug 2022
Publisher: Packt
ISBN-13: 9781803241678
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime