Implementing term frequency-inverse document frequency
Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that captures how relevant a word is in a document concerning the entire collection of documents. What does this mean? Some words will appear a lot within a text document as well as across documents, for example, the English words the, a, and is. These words generally convey little information about the actual content of the document and don’t make it stand out from the crowd. TF-IDF provides a way to weigh the importance of a word by contemplating how many times it appears in a document, concerning how often it appears across documents. Hence, commonly occurring words such as the, a, and is will have a low weight, and words more specific to a topic, such as leopard, will have a higher weight.
TF-IDF is the product of two statistics: term frequency and inverse document frequency. Term frequency is, in its simplest form, the count of the word in...