Creating our first classifier and tuning it
The Naive Bayes classifiers reside in the sklearn.naive_bayes
package. There are different kinds of Naive Bayes classifiers:
GaussianNB
: This assumes the features to be normally distributed (Gaussian). One use case for it could be the classification of sex according to the given height and width of a person. In our case, we are given tweet texts from which we extract word counts. These are clearly not Gaussian distributed.MultinomialNB
: This assumes the features to be occurrence counts, which is relevant to us since we will be using word counts in the tweets as features. In practice, this classifier also works well with TF-IDF vectors.BernoulliNB
: This is similar toMultinomialNB
, but more suited when using binary word occurrences and not word counts.
As we will mainly look at the word occurrences, for our purpose, MultinomialNB
is best suited.
Solving an easy problem first
As we have seen when we looked at our tweet data, the tweets are not just...