Essentially, the models we are building compute the pairwise similarity between bodies of text. But how do we numerically quantify the similarity between two bodies of text?
To put it another way, consider three movies: A, B, and C. How can we mathematically prove that the plot of A is more similar to the plot of B than to that of C (or vice versa)?
The first step toward answering these questions is to represent the bodies of text (henceforth referred to as documents) as mathematical quantities. This is done by representing these documents as vectors. In other words, every document is depicted as a series of n numbers, where each number represents a dimension and n is the size of the vocabulary of all the documents put together.
But what are the values of these vectors? The answer to that question depends on the vectorizer we are using to convert our documents...