The original skip-gram algorithm
The skip-gram algorithm discussed up to this point in the book is actually an improvement over the original skip-gram algorithm proposed in the original paper by Mikolov and others, published in 2013. In this paper, the algorithm did not use an intermediate hidden layer to learn the representations. In contrast, the original algorithm used two different embedding or projection layers (the input and output embeddings in Figure 4.1) and defined a cost function derived from the embeddings themselves:
The original negative sampled loss was defined as follows:
Here, v is the input embeddings layer, v' is the output word embeddings layer, corresponds to the embedding vector for the word wi in the input embeddings layer and corresponds to the word vector for the word wi in the output embeddings layer.
is the noise distribution, from which we sample noise samples (for example, it can be as...