We can use the existing parameter space and the existing classifier from our previous experiments—all we need to do is refit it on our new data. By default, training in scikit-learn is done from scratch—subsequent calls to fit() will discard any previous information.
There is a class of algorithms called online learning that update the training with new samples and don't restart their training each time.
As before, we can compute our scores by using cross_val_score and print the results. The code is as follows:
scores = cross_val_score(pipeline, documents, classes, scoring='f1')
print("Score: {:.3f}".format(np.mean(scores)))
The result is 0.683, which is a reasonable result for such a messy dataset. Adding more data (such as increasing max_docs_author in the dataset loading) can improve these results, as will improving the quality of the...