We discussed and built models based on the Word2Vec approach in Chapter 5, Word Embeddings and Distance Measurements for Text, wherein each word in the vocabulary had a vector representation. Word2Vec relies heavily on the vocabulary it has been trained to represent. Words that occur during inference times, if not present in the vocabulary, will be mapped to a possibly unknown token representation. There can be a lot of unseen words here:
Can we do better than this?
In certain languages, sub-words or internal word representations and structures carry important morphological information:
Can we capture this information?
To answer the preceding code block, yes, we can, and we will use fastText to capture the information contained in the sub-words:
What is fastText and how does it work?
Bojanowski et al., researchers from Facebook, built on top of the Word2Vec Skip-gram model developed by Mikolov et al., which we discussed in Chapter 5, Word Embeddings and Distance Measurements...