Putting it all together
We can now actually run the Naive Bayes classifier using these probabilities. We will do this in a Jupyter Notebook, although this processing itself can be transferred to a mrjob package to be performed at scale.
First, take a look at the models
folder that was specified in the last MapReduce job. If the output was more than one file, we can merge the files by just appending them to each other using a command line function from within the models
directory:
cat * > model.txt
If you do this, you'll need to update the following code with model.txt
as the model filename.
Back to our Notebook, we first import some standard imports we need for our script:
import os import re import numpy as np from collections import defaultdict from operator import itemgetter
We again redefine our word search regular expression—if you were doing this in a real application, I recommend centralizing the functionality. It is important that words are extracted in the same way for both training...