So, now that we can create vectors that encode the meaning of words, and we know that any given movie review post tokenization is an array of N words, we can begin creating a poor man's doc2vec by taking the average of all the words that make up the review. That is, for each review, by averaging the individual word vectors, we lose the specific sequencing of words, which, depending on the sensitivity of your application, can make a difference:
v(word_1) + v(word_2) + v(word_3) ... v(word_Z) / count(words in review)
Ideally, one would use a flavor of doc2vec to create document vectors; however, doc2vec has yet to be implemented in MLlib at the time of writing this book, so for now, we are going to use this simple version, which, as you will see, has surprising results. Fortunately, the Spark ML implementation of the word2vec model already averages...