Representing documents with TF-IDF and classifying with Naïve Bayes
In addition to evaluation, two important topics in the general paradigm of machine learning are representation and processing algorithms. Representation involves converting a text, such as a document, into a numerical format that preserves relevant information about the text. This information is then analyzed by the processing algorithm to perform the NLP application. You’ve already seen a common approach to representation, TF-IDF, in Chapter 7. In this section, we will cover using TF-IDF with a common classification approach, Naïve Bayes. We will explain both techniques and show an example.
Summary of TF-IDF
You will recall the discussion of TF-IDF from Chapter 7. TF-IDF is based on the intuitive goal of trying to find words in documents that are particularly diagnostic of their classification topic. Words that are relatively infrequent in the whole corpus, but which are relatively common in...