Leveraging term importance and semantics
Everything we have done up to now has been relatively simple and based on word stems or so-called tokens. The bag-of-words model was nothing but a dictionary of tokens counting the occurrence of tokens per field. In this section, we will take a look at a common technique to further improve matching between documents using n-gram and skip-gram combinations of terms.
Combining terms in multiple ways will explode your dictionary. This will turn into a problem if you have a large corpus; for example, 10 million words. Hence, we will look at a common preprocessing technique to reduce the dimensionality of a large dictionary through Singular Value Decomposition (SVD).
While this approach is, now, a lot more complicated, it is still based on a bag-of-words model that already works great on a large corpus, in practice. But, of course, we can do better and try to understand the importance of words. Therefore, we will tackle another popular techniqu...