Our first attempt at classifying the Titanic data is to use a naive, yet very intuitive, approach. This approach involves the following steps:
- Select a set of features, S, that influence whether a person survived or not.
- For each possible combination of features, use the training data to indicate whether the majority of cases survived or not. This can be evaluated in what is known as a survival matrix.
- For each test example that we wish to predict survival, look up the combination of features that corresponds to the values of its features and assign its predicted value to the survival value in the survival table. This approach is a naive K-nearest neighbor approach.
Based on what we have seen earlier in our analysis, three features seem to have the most influence on the survival rate:
- Passenger class
- Gender
- Passenger fare (bucketed)