TF-IDF stands for term frequency-inverse document frequency, which measures how important a word is to a document in a collection of documents. It is used extensively in informational retrieval and reflects the weightage of the word in the document. The TF-IDF value increases in proportion to the number of occurrences of the words otherwise known as frequency of the word/term and consists of two key elements, the term frequency and the inverse document frequency.
TF is the term frequency, which is the frequency of a word/term in the document.
For a term t, tf measures the number of times term t occurs in document d. tf is implemented in Spark using hashing where a term is mapped into an index by applying a hash function.
IDF is the inverse document frequency, which represents the information a term provides about the tendency of the term to appear in documents. IDF is a...