Preparing the data
We should do a few more data checks. Most importantly, let’s check the balance between classes in the training data:
- Start with the following code:
clf_df['label'].value_counts()
…
0 66
1 11
Name: label, dtype: int64
The data is imbalanced, but not too badly.
- Let’s get this in percentage form, just to make this a little easier to understand:
clf_df['label'].value_counts(normalize=True)
…
0 0.857143
1 0.142857
Name: label, dtype: float64
It looks like we have about an 86/14 balance between the classes. Not awful. Keep this in mind, because the model should be able to predict with about 86% accuracy just based on the imbalance alone. It won’t be an impressive model at all if it only hits 86%.
- Next, we need to cut up our data for our model. We will use the features as our
X
data, and the answers...