Naïve-Bayes model for finding keywords
Building an NB model on this dataset takes under an hour and has the potential to significantly increase the quality and coverage of the labeling functions. The core model code for the NB model can be found in the spam-inspired-technique-naive-bayes.ipynb
notebook. Note that these explorations are aside from the main labeling code, and this section can be skipped if desired, as the learnings from this section are applied to construct better labeling functions outlined in the snorkel-labeling.ipynb
notebook.
The main flow of the NB-based exploration is to load the reviews, remove stop words, take the top 2,000 words to construct a simple vectorization scheme, and train an NB model. Since data loading is the same as covered in previous sections, the details are skipped in this section.
This section uses the NLTK and wordcloud
Python packages. NLTK should already be installed as we have used it in Chapter 1, Essentials of...