In the previous chapter, we detected spam emails by applying naive Bayes classifier on the extracted feature space. The feature space was represented by term frequency (tf), where a collection of text documents was converted to a matrix of term counts. It reflected how terms are distributed in each individual document, however, without all documents across the entire corpus. For example, some words generally occur more often in the language, while some rarely occur, but convey important messages.
Because of this, it is encouraged to adopt a more comprehensive approach to extract text features, the term frequency-inverse document frequency (tf-idf): it assigns each term frequency a weighting factor that is inversely proportional to the document frequency, the fraction of documents containing this term. In practice, the idf factor of a term t in documents D is calculated as follows...