Comparing skip-gram with CBOW
Before looking at the performance differences and investigating reasons, let's remind ourselves about the fundamental difference between the skip-gram and CBOW methods.
As shown in the following figures, given a context and a target word, skip-gram observes only the target word and a single word of the context in a single input/output tuple. However, CBOW observes the target word and all the words in the context in a single sample. For example, if we assume the phrase dog barked at the mailman, skip-gram sees an input-output tuple such as ["dog", "at"] at a single time step, whereas CBOW sees an input-output tuple [["dog","barked","the","mailman"], "at"]. Therefore, in a given batch of data, CBOW receives more information than skip-gram about the context of a given word. Let's next see how this difference affects the performance of the two algorithms.
As shown in the preceding figures, the CBOW model has access to more information (inputs) at a given time compared...