In this section, we will use the libsvm version of 20newsgroup data to use the Spark DataFrame-based APIs to classify the text documents. In the current version of Spark libsvm version 3.22 is supported (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/)
Download the libsvm formatted data from the following link and copy output folder under Spark-2.0.x.
Visit the following link for the 20newsgroup libsvm data: https://1drv.ms/f/s!Av6fk5nQi2j-iF84quUlDnJc6G6D
Import the appropriate packages from org.apache.spark.ml and create Wrapper Scala:
package org.apache.spark.examples.ml
import org.apache.spark.SparkConf
import org.apache.spark.ml.classification.NaiveBayes
import
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.sql.SparkSession
object DocumentClassificationLibSVM {
def main(args: Array[String]): Unit = {
...