You're reading from Data Cleaning and Exploration with Machine Learning Get to grips with machine learning techniques to achieve sparkling-clean data quickly

Product type Paperback

Published in Aug 2022

Publisher Packt

ISBN-13 9781803241678

Length 542 pages

Edition 1st Edition

Concepts

Machine Learning

Author (1):

Michael Walker

View More author details

Table of Contents (23) Chapters

Preface

1. Section 1 – Data Cleaning and Machine Learning Algorithms

2. Chapter 1: Examining the Distribution of Features and Targets FREE CHAPTER

3. Chapter 2: Examining Bivariate and Multivariate Relationships between Features and Targets

4. Chapter 3: Identifying and Fixing Missing Values

5. Section 2 – Preprocessing, Feature Selection, and Sampling

6. Chapter 4: Encoding, Transforming, and Scaling Features

7. Chapter 5: Feature Selection

8. Chapter 6: Preparing for Model Evaluation

9. Section 3 – Modeling Continuous Targets with Supervised Learning

10. Chapter 7: Linear Regression Models

11. Chapter 8: Support Vector Regression

12. Chapter 9: K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression

13. Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning

14. Chapter 10: Logistic Regression

15. Chapter 11: Decision Trees and Random Forest Classification

16. Chapter 12: K-Nearest Neighbors for Classification

17. Chapter 13: Support Vector Machine Classification

18. Chapter 14: Naïve Bayes Classification

19. Section 5 – Clustering and Dimensionality Reduction with Unsupervised Learning

20. Chapter 15: Principal Component Analysis

21. Chapter 16: K-Means and DBSCAN Clustering

22. Other Books You May Enjoy

Generating frequencies for categorical features

Categorical features can be either nominal or ordinal. Nominal features, such as gender, species name, or country, have a limited number of possible values, and are either strings or are numerical without having any intrinsic numerical meaning. For example, if country is represented by 1 for Afghanistan, 2 for Albania, and so on, the data is numerical but it does not make sense to perform arithmetic operations on those values.

Ordinal features also have a limited number of possible values but are different from nominal features in that the order of the values matters. A Likert scale rating (ranging from 1 for very unlikely to 5 for very likely) is an example of an ordinal feature. Nonetheless, arithmetic operations would not typically make sense because there is no uniform and meaningful distance between values.

Before we begin modeling, we want to have counts of all the possible values for the categorical features we may use. This is typically referred to as a one-way frequency distribution. Fortunately, pandas makes this very easy to do. We can quickly select columns from a pandas DataFrame and use the value_counts method to generate counts for each categorical value:

Let's load the NLS data, create a DataFrame that contains just the first 20 columns of the data, and look at the data types:

nls97 = pd.read_csv("data/nls97.csv")
nls97.set_index("personid", inplace=True)
nls97abb = nls97.iloc[:,:20]
nls97abb.dtypes
gender                   object
birthmonth               int64
birthyear                int64
highestgradecompleted    float64
maritalstatus            object
childathome              float64
childnotathome           float64
wageincome               float64
weeklyhrscomputer        object
weeklyhrstv              object
nightlyhrssleep          float64
satverbal                float64
satmath                  float64
gpaoverall               float64
gpaenglish               float64
gpamath                  float64
gpascience               float64
highestdegree            object
govprovidejobs           object
govpricecontrols         object
dtype: object

Note

Recall from the previous section how column and row selection works with the loc and iloc accessors. The colon to the left of the comma indicates that we want all the rows, while :20 to the right of the comma gets us the first 20 columns.

All of the object type columns in the preceding code are categorical. We can use value_counts to see the counts for each value for maritalstatus. We can also use dropna=False to get value_counts to show the missing values (NaN):

nls97abb.maritalstatus.value_counts(dropna=False)
Married          3066
Never-married    2766
NaN              2312
Divorced         663
Separated        154
Widowed          23
Name: maritalstatus, dtype: int64

If we just want the number of missing values, we can chain the isnull and sum methods. isnull returns a Boolean Series containing True values when maritalstatus is missing and False otherwise. sum then counts the number of True values, since it will interpret True values as 1 and False values as 0:
```
nls97abb.maritalstatus.isnull().sum()
2312
```
You have probably noticed that the maritalstatus values were sorted by frequency by default. You can sort them alphabetically by values by sorting the index. We can do this by taking advantage of the fact that value_counts returns a Series with the values as the index:
```
marstatcnt = nls97abb.maritalstatus.value_counts(dropna=False)
type(marstatcnt)
<class 'pandas.core.series.Series'>
marstatcnt.index
Index(['Married', 'Never-married', nan, 'Divorced', 'Separated', 'Widowed'], dtype='object')
```

To sort the index, we just need to call sort_index:

marstatcnt.sort_index()
Divorced         663
Married          3066
Never-married    2766
Separated        154
Widowed          23
NaN              2312
Name: maritalstatus, dtype: int64

Of course, we could have gotten the same results in one step with nls97.maritalstatus.value_counts(dropna=False).sort_index(). We can also show ratios instead of counts by setting normalize to True. In the following code, we can see that 34% of the responses were Married (notice that we did not set dropna to True, so missing values have been excluded):

nls97.maritalstatus.\
  value_counts(normalize=True, dropna=False).\
     sort_index()
 
Divorced             0.07
Married              0.34
Never-married        0.31
Separated            0.02
Widowed              0.00
NaN                  0.26
Name: maritalstatus, dtype: float64

pandas has a category data type that can store data much more efficiently than the object data type when a column has a limited number of values. Since we already know that all of our object columns contain categorical data, we should convert those columns into the category data type. In the following code, we're creating a list that contains the column names for the object columns, catcols. Then, we're looping through those columns and using astype to change the data type to category:

catcols = nls97abb.select_dtypes(include=["object"]).columns
for col in nls97abb[catcols].columns:
...      nls97abb[col] = nls97abb[col].astype('category')
... 
nls97abb[catcols].dtypes
gender                   category
maritalstatus            category
weeklyhrscomputer        category
weeklyhrstv              category
highestdegree            category
govprovidejobs           category
govpricecontrols         category
dtype: object

Let's check our category features for missing values. There are no missing values for gender and very few for highestdegree. But the overwhelming majority of values for govprovidejobs (the government should provide jobs) and govpricecontrols (the government should control prices) are missing. This means that those features probably won't be useful for most modeling:
```
nls97abb[catcols].isnull().sum()
gender               0
maritalstatus        2312
weeklyhrscomputer    2274
weeklyhrstv          2273
highestdegree        31
govprovidejobs       7151
govpricecontrols     7125
dtype: int64
```

We can generate frequencies for multiple features at once by passing a value_counts call to apply. We can use filter to select the columns that we want – in this case, all the columns with gov in their name. Note that the missing values for each feature have been omitted since we did not set dropna to False:

 nls97abb.filter(like="gov").apply(pd.value_counts, normalize=True)
                 govprovidejobs    govpricecontrols
1. Definitely              0.25                0.54
2. Probably                0.34                0.33
3. Probably not            0.25                0.09
4. Definitely not          0.16                0.04

We can use the same frequencies on a subset of our data. If, for example, we want to see the responses of only married people to the government role questions, we can do that subsetting by placing nls97abb[nls97abb.maritalstatus=="Married"] before filter:

 nls97abb.loc[nls97abb.maritalstatus=="Married"].\
 filter(like="gov").\
   apply(pd.value_counts, normalize=True)
                 govprovidejobs    govpricecontrols
1. Definitely              0.17                0.46
2. Probably                0.33                0.38
3. Probably not            0.31                0.11
4. Definitely not          0.18                0.05

Since, in this case, there were only two gov columns, it may have been easier to do the following:

 nls97abb.loc[nls97abb.maritalstatus=="Married",
   ['govprovidejobs','govpricecontrols']].\
   apply(pd.value_counts, normalize=True)
                  govprovidejobs     govpricecontrols
1. Definitely               0.17                 0.46
2. Probably                 0.33                 0.38
3. Probably not             0.31                 0.11
4. Definitely not           0.18                 0.05

Nonetheless, it will often be easier to use filter since it is not unusual to have to do the same cleaning or exploration task on groups of features with similar names.

There are times when we may want to model a continuous or discrete feature as categorical. The NLS DataFrame contains highestgradecompleted. A year increase from 5 to 6 may not be as important as that from 11 to 12 in terms of its impact on a target. Let's create a dichotomous feature instead – that is, 1 when the person has completed 12 or more grades, 0 if they have completed less than that, and missing when highestgradecompleted is missing.

We need to do a little bit of cleaning up first, though. highestgradecompleted has two logical missing values – an actual NaN value that pandas recognizes as missing and a 95 value that the survey designers intend for us to also treat as missing for most use cases. Let's use replace to fix that before moving on:
```
nls97abb.highestgradecompleted.\
  replace(95, np.nan, inplace=True)
```

We can use NumPy's where function to assign values to highschoolgrad based on the values of highestgradecompleted. If highestgradecompleted is null (NaN), we assign NaN to our new column, highschoolgrad. If the value for highestgradecompleted is not null, the next clause tests for a value less than 12, setting highschoolgrad to 0 if that is true, and to 1 otherwise. We can confirm that the new column, highschoolgrad, contains the values we want by using groupby to get the min and max values of highestgradecompleted at each level of highschoolgrad:

nls97abb['highschoolgrad'] = \
  np.where(nls97abb.highestgradecompleted.isnull(),np.nan, \
  np.where(nls97abb.highestgradecompleted<12,0,1))
 
nls97abb.groupby(['highschoolgrad'], dropna=False) \
  ['highestgradecompleted'].agg(['min','max','size'])
                  min       max       size
highschoolgrad                
0                   5        11       1231
1                  12        20       5421
nan               nan       nan       2332
 nls97abb['highschoolgrad'] = \
...  nls97abb['highschoolgrad'].astype('category')

While 12 makes conceptual sense as the threshold for classifying our new feature, highschoolgrad, this would present some modeling challenges if we intended to use highschoolgrad as a target. There is a pretty substantial class imbalance, with highschoolgrad equal to 1 class being more than 4 times the size of the 0 group. We should explore using more groups to represent highestgradecompleted.

One way to do this with pandas is with the qcut function. We can set the q parameter of qcut to 6 to create six groups that are as evenly distributed as possible. These groups are now closer to being balanced:

nls97abb['highgradegroup'] = \
  pd.qcut(nls97abb['highestgradecompleted'], 
   q=6, labels=[1,2,3,4,5,6])
 
nls97abb.groupby(['highgradegroup'])['highestgradecompleted'].\
    agg(['min','max','size'])
                  min         max      size
highgradegroup                
1                   5          11       1231
2                  12          12       1389
3                  13          14       1288
4                  15          16       1413
5                  17          17        388
6                  18          20        943
nls97abb['highgradegroup'] = \
    nls97abb['highgradegroup'].astype('category')

Finally, I typically find it helpful to generate frequencies for all the categorical features and save that output so that I can refer to it later. I rerun that code whenever I make some change to the data that may change these frequencies. The following code iterates over all the columns that are of the category data type and runs value_counts:

 freqout = open('views/frequencies.txt', 'w') 
 for col in nls97abb.select_dtypes(include=["category"]):
      print(col, "----------------------",
        "frequencies",
      nls97abb[col].value_counts(dropna=False).sort_index(),
        "percentages",
      nls97abb[col].value_counts(normalize=True).\
        sort_index(),
      sep="\n\n", end="\n\n\n", file=freqout)
 
 freqout.close()

These are the key techniques for generating one-way frequencies for the categorical features in your data. The real star of the show has been the value_counts method. We can use value_counts to create frequencies a Series at a time, use it with apply for multiple columns, or iterate over several columns and call value_counts each time. We have looked at examples of each in this section. Next, let's explore some techniques for examining the distribution of continuous features.

You're reading from Data Cleaning and Exploration with Machine Learning Get to grips with machine learning techniques to achieve sparkling-clean data quickly

Table of Contents (23) Chapters

Generating frequencies for categorical features

Authors (1)

Personalised recommendations for you