Leveraging term importance and semantics
Everything we have done up to now has been relatively simple and based on word stems or so-called tokens. The bag-of-words model was nothing but a dictionary of tokens counting the occurrence of tokens per field. In this section, we will take a look at a common technique to further improve matching between documents using n-gram and skip-gram combinations of terms.
Combining terms in multiple ways will explode your dictionary. This will turn into a problem if you have a large corpus; for instance, 10 million words. Hence, we will look at a common preprocessing technique to reduce the dimensionality of a large dictionary through SVD.
While, now, this approach is a lot more complicated, it is still based on a bag-of-words model that already works well on a large corpus, in practice. However, of course, we can do better and try to understand the importance of words. Therefore, we will tackle another popular technique in NLP to compute the...