Statistical inference for categorical data
A categorical variable has distinct categories or levels, rather than numerical values. Categorical data is common in our daily lives, such as gender (male or female, although a modern view may differ), type of property sales (new property or resale), and industry. The ability to make sound inferences about these variables is thus essential for drawing meaningful conclusions and making well-informed decisions in diverse contexts.
Being a categorical variable often means we cannot pass it to a machine learning (ML) model without additional preprocessing. Take the industry variable, for example. Instead of passing the categorical values (string
values such as "finance"
or "technology"
) to the model, a common approach is to one-hot encode the variable into multiple columns, with each column corresponding to a specific industry, indicating a binary value of 0
or 1
.
In this section, we will explore various statistical...