Slimming the classifier
It is always worth looking at the actual contributions of the individual features. For logistic regression, we can directly take the learned coefficients (clf.coef_
) to get an impression of the feature's impact. The higher the coefficient of a feature is, the more the feature plays a role in determining whether the post is good or not. Consequently, negative coefficients tell us that the higher values for the corresponding features indicate a stronger signal for the post to be classified as bad:
We see that LinkCount
and NumExclams
have the biggest impact on the overall classification decision, while NumImages
and AvgSentLen
play a rather minor role. While the feature importance overall makes sense intuitively, it is surprising that NumImages
is basically ignored. Normally, answers containing images are always rated high. In reality, however, answers very rarely have images. So although in principal it is a very powerful feature, it is too sparse to be of any value...