Implementing term frequency-inverse document frequency
Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that captures how relevant a word is in a document considering the entire collection of documents. What does this mean? Some words will appear a lot within a text document as well as across documents, such as the English words the, a, and is, for example. These words generally convey little information about the actual content of the document and don’t make the text stand out from the crowd. TF-IDF provides a way to weigh the importance of a word by considering how many times it appears in a document with regards to how often it appears across documents. Hence, commonly occurring words such as the, a, or is will have a low weight, and words that are more specific to a topic, such as leopard, will have a higher weight.
TF-IDF is the product of two statistics: Term Frequency (tf) and Inverse Document Frequency (idf), represented as follows: tf-idf...