The VSM, or term vector model, is an algebraic model for representing text documents as vectors of identifiers such as index terms. It is used in information filtering, information retrieval, indexing, and relevancy rankings.
In VSM, weights associated with the terms are calculated based on the following two numbers:
- Term frequency (TF): How many times a particular term appears in the document
- Inverse document frequency (IDF): How important a word is to a document in a collection
VSM is implemented in a lot of open source software, including Apache Lucene, Elasticsearch, Genism, Numpy, Weka, word2vec, and Konstanz Information Miner (KNIME).