Decision trees
Decision trees are simple algorithms, which split data based on specific values of the data. Let's use our data from the Test your knowledge sections in Chapters 11 and 12, which is loan data, with a TARGET
column that denotes whether someone had trouble paying back a loan (1) or not (0).
First, we'll load the data:
import pandas as pd
df = pd.read_csv('data/loan_data_sample.csv', index_col='SK_ID_CURR')
If the examples are running slowly on your computer, you might sample down the data using df.sample()
. There are some string columns that need to be converted to numeric datatypes, since sklearn
can only handle numeric data:
numeric_df = df.copy()
numeric_df['NAME_CONTRACT_TYPE'] = numeric_df['NAME_CONTRACT_TYPE'].map(
{'Cash loans': 0, 'Revolving loans': 1})
numeric_df['CODE_GENDER'] = numeric_df['CODE_GENDER'].map({'M': 0, 'F': 1})
numeric_df...