What TF-IDF is
One-hot encoding simply records the presence of a word but does not reflect any of its relative importance. BoW is an improvement over one-hot encoding by measuring word frequency. However, word frequency does not imply word importance. For example, in Figure 2.2, the word “the” appears twice in the first sentence, and “be” appears three times in the second sentence, but they do not add any specific color to the poem. In linguistics, it is often the case that words that appear less carry more distinctive meanings. The terms “shining,” “steal,” “night,” and “sky” paint a picture vividly in our poem. Can we improve upon one-hot encoding or BoW?
Term frequency–inverse document frequency (TD-IDF) is designed to reflect the importance of a word in a document of a corpus. Many frequently used words such as “ the,” “he,” “she,” “we,...