One way to overcome the curse of dimensionality is by learning a lower-dimensional, distributed representation of the words (A Neural Probabilistic Language Model, http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf). This distributed representation is created by learning an embedding function that transforms the space of words into a lower-dimensional space of word embeddings as follows:
Words from the vocabulary with size V are transformed into one-hot encoding vectors of size V (each word is encoded uniquely). Then, the embedding function transforms this V-dimensional space into a distributed representation of size D (here, D=4).
The idea is that the embedding function learns semantic information about the words. It associates each word in the vocabulary with a continuous-valued...