Displaying similar words with Spark using Word2Vec
In this recipe, we will explore Word2Vec, which is Spark's tool for assessing word similarity. The Word2Vec algorithm is inspired by the distributional hypothesis in general linguistics. At the core, what it tries to say is that the tokens which occur in the same context (that is, distance from the target) tend to support the same primitive concept/meaning.
The Word2Vec algorithm was invented by a team of researchers at Google. Please refer to a white paper mentioned in the There's more... section of this recipe which describes Word2Vec in more detail.
How to do it...
- Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
- The
package
statement for the recipe is as follows:
package spark.ml.cookbook.chapter12
- Import the necessary packages for Scala and Spark:
import org.apache.log4j.{Level, Logger} import org.apache.spark.ml.feature.{RegexTokenizer, StopWordsRemover, Word2Vec} import org.apache...