We are going to build a sample text classifier based on the NLTK Reuters corpus. This one is made up of thousands of news lines divided into 90 categories:
from nltk.corpus import reuters
print(reuters.categories())
[u'acq', u'alum', u'barley', u'bop', u'carcass', u'castor-oil', u'cocoa', u'coconut', u'coconut-oil', u'coffee', u'copper', u'copra-cake', u'corn', ...
To simplify the process, we'll take only two categories, which have a similar number of documents:
import numpy as np
Xr = np.array(reuters.sents(categories=['rubber']))
Xc = np.array(reuters.sents(categories=['cotton']))
Xw = np.concatenate((Xr, Xc))
As each document is already split into tokens and we want to apply...