In this section, we look at how to use the Spark ML DataFrame and newer implementations from Spark 2.0.X to create a Word2Vector model.
We will create a DataFrame from the dataSet:
val spConfig = (new
SparkConf).setMaster("local").setAppName("SparkApp")
val spark = SparkSession
.builder
.appName("Word2Vec Sample").config(spConfig)
.getOrCreate()
import spark.implicits._
val rawDF = spark.sparkContext
.wholeTextFiles("./data/20news-bydate-train/alt.atheism/*")
val temp = rawDF.map( x => {
(x._2.filter(_ >= ' ').filter(! _.toString.startsWith("(")) )
})
val textDF = temp.map(x => x.split(" ")).map(Tuple1.apply)
.toDF("text")
This will be followed by creating the Word2Vec class and training the model on the DataFrame textDF created above:
val word2Vec = new Word2Vec...