The next few sets of features are based on TF-IDF and SVD. Term Frequency-Inverse Document Frequency (TF-IDF). Is one of the algorithms at the foundation of information retrieval. Here, the algorithm is explained using a formula:
You can understand the formula using this notation: C(t) is the number of times a term t appears in a document, N is the total number of terms in the document, this results in the Term Frequency (TF). ND is the total number of documents and NDt is the number of documents containing the term t, this provides the Inverse Document Frequency (IDF). TF-IDF for a term t is a multiplication of Term Frequency and Inverse Document Frequency for the given term t:
Without any prior knowledge, other than about the documents themselves, such a score will highlight all the terms that could easily discriminate a document from...