Now that you have a good understanding of word2vec, doc2vec, and the incredible power of word vectors, it's time we turned our focus to our original IMDB dataset, whereby we will perform the following preprocessing:
- Split words in each movie review by a space
- Remove punctuation
- Remove stopwords and all alphanumeric words
- Using our tokenization function from the previous chapter, we will end with an array of comma-separated words
Because we have already covered the preceding steps in Chapter 4, Predicting Movie Reviews Using NLP and Spark Streaming, we'll quickly reproduce them in this section.
As usual, we begin with starting the Spark shell, which is our working environment:
export SPARKLING_WATER_VERSION="2.1.12" export SPARK_PACKAGES=\ "ai.h2o:sparkling-water-core_2.11:${SPARKLING_WATER_VERSION...