Generating frequencies for categorical features
Categorical features can be either nominal or ordinal. Nominal features, such as gender, species name, or country, have a limited number of possible values, and are either strings or are numerical without having any intrinsic numerical meaning. For example, if country is represented by 1 for Afghanistan, 2 for Albania, and so on, the data is numerical but it does not make sense to perform arithmetic operations on those values.
Ordinal features also have a limited number of possible values but are different from nominal features in that the order of the values matters. A Likert scale rating (ranging from 1 for very unlikely to 5 for very likely) is an example of an ordinal feature. Nonetheless, arithmetic operations would not typically make sense because there is no uniform and meaningful distance between values.
Before we begin modeling, we want to have counts of all the possible values for the categorical features we may use. This is typically referred to as a one-way frequency distribution. Fortunately, pandas makes this very easy to do. We can quickly select columns from a pandas DataFrame and use the value_counts
method to generate counts for each categorical value:
- Let's load the NLS data, create a DataFrame that contains just the first 20 columns of the data, and look at the data types:
nls97 = pd.read_csv("data/nls97.csv") nls97.set_index("personid", inplace=True) nls97abb = nls97.iloc[:,:20] nls97abb.dtypes gender object birthmonth int64 birthyear int64 highestgradecompleted float64 maritalstatus object childathome float64 childnotathome float64 wageincome float64 weeklyhrscomputer object weeklyhrstv object nightlyhrssleep float64 satverbal float64 satmath float64 gpaoverall float64 gpaenglish float64 gpamath float64 gpascience float64 highestdegree object govprovidejobs object govpricecontrols object dtype: object
Note
Recall from the previous section how column and row selection works with the
loc
andiloc
accessors. The colon to the left of the comma indicates that we want all the rows, while:20
to the right of the comma gets us the first 20 columns. - All of the object type columns in the preceding code are categorical. We can use
value_counts
to see the counts for each value formaritalstatus
. We can also usedropna=False
to getvalue_counts
to show the missing values (NaN
):nls97abb.maritalstatus.value_counts(dropna=False) Married 3066 Never-married 2766 NaN 2312 Divorced 663 Separated 154 Widowed 23 Name: maritalstatus, dtype: int64
- If we just want the number of missing values, we can chain the
isnull
andsum
methods.isnull
returns a Boolean Series containingTrue
values whenmaritalstatus
is missing andFalse
otherwise.sum
then counts the number ofTrue
values, since it will interpretTrue
values as 1 andFalse
values as 0:nls97abb.maritalstatus.isnull().sum() 2312
- You have probably noticed that the
maritalstatus
values were sorted by frequency by default. You can sort them alphabetically by values by sorting the index. We can do this by taking advantage of the fact thatvalue_counts
returns a Series with the values as the index:marstatcnt = nls97abb.maritalstatus.value_counts(dropna=False) type(marstatcnt) <class 'pandas.core.series.Series'> marstatcnt.index Index(['Married', 'Never-married', nan, 'Divorced', 'Separated', 'Widowed'], dtype='object')
- To sort the index, we just need to call
sort_index
:marstatcnt.sort_index() Divorced 663 Married 3066 Never-married 2766 Separated 154 Widowed 23 NaN 2312 Name: maritalstatus, dtype: int64
- Of course, we could have gotten the same results in one step with
nls97.maritalstatus.value_counts(dropna=False).sort_index()
. We can also show ratios instead of counts by settingnormalize
toTrue
. In the following code, we can see that 34% of the responses wereMarried
(notice that we did not setdropna
toTrue
, so missing values have been excluded):nls97.maritalstatus.\ value_counts(normalize=True, dropna=False).\ sort_index() Divorced 0.07 Married 0.34 Never-married 0.31 Separated 0.02 Widowed 0.00 NaN 0.26 Name: maritalstatus, dtype: float64
- pandas has a category data type that can store data much more efficiently than the object data type when a column has a limited number of values. Since we already know that all of our object columns contain categorical data, we should convert those columns into the category data type. In the following code, we're creating a list that contains the column names for the object columns,
catcols
. Then, we're looping through those columns and usingastype
to change the data type tocategory
:catcols = nls97abb.select_dtypes(include=["object"]).columns for col in nls97abb[catcols].columns: ... nls97abb[col] = nls97abb[col].astype('category') ... nls97abb[catcols].dtypes gender category maritalstatus category weeklyhrscomputer category weeklyhrstv category highestdegree category govprovidejobs category govpricecontrols category dtype: object
- Let's check our category features for missing values. There are no missing values for
gender
and very few forhighestdegree
. But the overwhelming majority of values forgovprovidejobs
(the government should provide jobs) andgovpricecontrols
(the government should control prices) are missing. This means that those features probably won't be useful for most modeling:nls97abb[catcols].isnull().sum() gender 0 maritalstatus 2312 weeklyhrscomputer 2274 weeklyhrstv 2273 highestdegree 31 govprovidejobs 7151 govpricecontrols 7125 dtype: int64
- We can generate frequencies for multiple features at once by passing a
value_counts
call toapply
. We can usefilter
to select the columns that we want – in this case, all the columns with gov in their name. Note that the missing values for each feature have been omitted since we did not setdropna
toFalse
:nls97abb.filter(like="gov").apply(pd.value_counts, normalize=True) govprovidejobs govpricecontrols 1. Definitely 0.25 0.54 2. Probably 0.34 0.33 3. Probably not 0.25 0.09 4. Definitely not 0.16 0.04
- We can use the same frequencies on a subset of our data. If, for example, we want to see the responses of only married people to the government role questions, we can do that subsetting by placing
nls97abb[nls97abb.maritalstatus=="Married"]
beforefilter
:nls97abb.loc[nls97abb.maritalstatus=="Married"].\ filter(like="gov").\ apply(pd.value_counts, normalize=True) govprovidejobs govpricecontrols 1. Definitely 0.17 0.46 2. Probably 0.33 0.38 3. Probably not 0.31 0.11 4. Definitely not 0.18 0.05
- Since, in this case, there were only two gov columns, it may have been easier to do the following:
nls97abb.loc[nls97abb.maritalstatus=="Married", ['govprovidejobs','govpricecontrols']].\ apply(pd.value_counts, normalize=True) govprovidejobs govpricecontrols 1. Definitely 0.17 0.46 2. Probably 0.33 0.38 3. Probably not 0.31 0.11 4. Definitely not 0.18 0.05
Nonetheless, it will often be easier to use filter
since it is not unusual to have to do the same cleaning or exploration task on groups of features with similar names.
There are times when we may want to model a continuous or discrete feature as categorical. The NLS DataFrame contains highestgradecompleted
. A year increase from 5 to 6 may not be as important as that from 11 to 12 in terms of its impact on a target. Let's create a dichotomous feature instead – that is, 1 when the person has completed 12 or more grades, 0 if they have completed less than that, and missing when highestgradecompleted
is missing.
- We need to do a little bit of cleaning up first, though.
highestgradecompleted
has two logical missing values – an actual NaN value that pandas recognizes as missing and a 95 value that the survey designers intend for us to also treat as missing for most use cases. Let's usereplace
to fix that before moving on:nls97abb.highestgradecompleted.\ replace(95, np.nan, inplace=True)
- We can use NumPy's
where
function to assign values tohighschoolgrad
based on the values ofhighestgradecompleted
. Ifhighestgradecompleted
is null (NaN
), we assignNaN
to our new column,highschoolgrad
. If the value forhighestgradecompleted
is not null, the next clause tests for a value less than 12, settinghighschoolgrad
to 0 if that is true, and to 1 otherwise. We can confirm that the new column,highschoolgrad
, contains the values we want by usinggroupby
to get the min and max values ofhighestgradecompleted
at each level ofhighschoolgrad
:nls97abb['highschoolgrad'] = \ np.where(nls97abb.highestgradecompleted.isnull(),np.nan, \ np.where(nls97abb.highestgradecompleted<12,0,1)) nls97abb.groupby(['highschoolgrad'], dropna=False) \ ['highestgradecompleted'].agg(['min','max','size']) min max size highschoolgrad 0 5 11 1231 1 12 20 5421 nan nan nan 2332 nls97abb['highschoolgrad'] = \ ... nls97abb['highschoolgrad'].astype('category')
While 12 makes conceptual sense as the threshold for classifying our new feature, highschoolgrad
, this would present some modeling challenges if we intended to use highschoolgrad
as a target. There is a pretty substantial class imbalance, with highschoolgrad
equal to 1 class being more than 4 times the size of the 0 group. We should explore using more groups to represent highestgradecompleted
.
- One way to do this with pandas is with the
qcut
function. We can set theq
parameter ofqcut
to6
to create six groups that are as evenly distributed as possible. These groups are now closer to being balanced:nls97abb['highgradegroup'] = \ pd.qcut(nls97abb['highestgradecompleted'], q=6, labels=[1,2,3,4,5,6]) nls97abb.groupby(['highgradegroup'])['highestgradecompleted'].\ agg(['min','max','size']) min max size highgradegroup 1 5 11 1231 2 12 12 1389 3 13 14 1288 4 15 16 1413 5 17 17 388 6 18 20 943 nls97abb['highgradegroup'] = \ nls97abb['highgradegroup'].astype('category')
- Finally, I typically find it helpful to generate frequencies for all the categorical features and save that output so that I can refer to it later. I rerun that code whenever I make some change to the data that may change these frequencies. The following code iterates over all the columns that are of the category data type and runs
value_counts
:freqout = open('views/frequencies.txt', 'w') for col in nls97abb.select_dtypes(include=["category"]): print(col, "----------------------", "frequencies", nls97abb[col].value_counts(dropna=False).sort_index(), "percentages", nls97abb[col].value_counts(normalize=True).\ sort_index(), sep="\n\n", end="\n\n\n", file=freqout) freqout.close()
These are the key techniques for generating one-way frequencies for the categorical features in your data. The real star of the show has been the value_counts
method. We can use value_counts
to create frequencies a Series at a time, use it with apply
for multiple columns, or iterate over several columns and call value_counts
each time. We have looked at examples of each in this section. Next, let's explore some techniques for examining the distribution of continuous features.