Twitter and the Godwin point
With our text content properly cleaned up, we can feed a Word2Vec algorithm and attempt to understand the words in their actual context.
Learning context
As it says on the tin, the Word2Vec algorithm transforms a word into a vector. The idea is that similar words will be embedded into similar vector spaces and, as such, will look close to one another contextually.
Note
More information about Word2Vec
algorithm can be found at https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf.
Well integrated into Spark, a Word2Vec
model can be trained as follows:
import org.apache.spark.mllib.feature.Word2Vec val corpusRDD = tweetRDD .map(_.body.split("\\s").toSeq) .filter(_.distinct.length >= 4) val model = new Word2Vec().fit(corpusRDD)
Here we extract each tweet as a sequence of words, only keeping records with at least 4
distinct words. Note that the list of all words...