We will build a classifier to estimate the income bracket of a person based on 14 attributes. The possible output classes are higher than 50,000 or lower than or equal to 50,000. There is a slight twist in this dataset, in the sense that each datapoint is a mixture of numbers and strings. Numerical data is valuable, and we cannot use a label encoder in these situations. We need to design a system that can deal with numerical and non-numerical data at the same time.
Estimating the income bracket
Getting ready
We will use the census income dataset available at https://archive.ics.uci.edu/ml/datasets/Census+Income.
The dataset has the following characteristics:
- Number of instances: 48,842
- Number of attributes: 14
The following...