Grouping categorical values
In the data used for modeling, we frequently find attributes with a large number of different categorical values. A typical example is product codes, identifying a product purchased by a customer.
A data attribute with many different values can cause problems for data mining algorithms; complex data can make the algorithms run slowly, and may make it more difficult to find the patterns in the data, leading to less accurate models. A useful step in data preparation is to simplify this kind of complex data by grouping the values of a categorical variable into a smaller range of values, where the grouping has a relationship to the problem to be solved.
This recipe shows how to group product codes by their relation to a target response variable. It produces product groups, which are groupings of product codes, based on deciles of the response rates for each product code.
Getting ready
This recipe uses the following files:
- Datafile:
Transactions_File.txt
- Datafile:
Promotions_File...