Analyzing documents using tf-idf
In this section, we will learn how to analyze documents quantitatively. A simple way is to look at the distribution of unigram words across the document and their frequency of occurrence, also termed as term frequency (tf). The words with higher frequency of occurrence generally tend to dominate the document.
However, one would disagree in case of generally occurring words such as the, is, of, and so on. Hence, these are removed by stop word dictionaries. Apart from these stop words, there might be some specific words that are more frequent with less relevance. Such kinds of words are penalized using their inverse document frequency (idf) values. Here, the words with higher frequency of occurrence are penalized.
Note
The statistic tf-idf combines these two quantities (by multiplication) and provides a measure of importance or relevance of each word for a given document across multiple documents (or a corpus).
In this section, we will generate a tf-idf matrix...