Putting it all together
We can use the existing parameter space and the existing classifier from our previous experiments—all we need to do is refit it on our new data. By default, training in scikit-learn is done from scratch—subsequent calls to fit()
will discard any previous information.
Note
There is a class of algorithms called online learning that update the training with new samples and don't restart their training each time.
As before, we can compute our scores by using cross_val_score
and print the results. The code is as follows:
scores = cross_val_score(pipeline, documents, classes, scoring='f1') print("Score: {:.3f}".format(np.mean(scores)))
The result is 0.683, which is a reasonable result for such a messy dataset. Adding more data (such as increasing max_docs_author
in the dataset loading) can improve these results, as will improving the quality of the data with extra cleaning.