Implementing TF-IDF
Since we can choose the embedding for each word, we might decide to change the weighting on certain words. One such strategy is to upweight useful words and downweight overly common or too rare words. The embedding we will explore in this recipe is an attempt to achieve this.
Getting ready
TF-IDF is an acronym that stands for Text Frequency – Inverse Document Frequency. This term is essentially the product of text frequency and inverse document frequency for each word.
In the prior recipe, we introduced the bag of words methodology, which assigned a value of one for every occurrence of a word in a sentence. This is probably not ideal as each category of sentence (spam and ham for the prior recipe example) most likely has the same frequency of the, and, and other words, whereas words such as viagra and sale probably should have increased importance in figuring out whether or not the text is spam.
We first want to take into consideration the word frequency. Here we consider...