LDA is a topic model, which infers topics from a collection of text documents. LDA can be thought of as an unsupervised clustering algorithm as follows:
- Topics correspond to cluster centers and documents correspond to rows in a dataset
- Topics and documents both exist in a feature space, where feature vectors are vectors of word counts
- Rather than estimating a clustering using a traditional distance, LDA uses a function based on a statistical model of how text documents are generated
In order to invoke LDA, you need to import the package:
import org.apache.spark.ml.clustering.LDA
Step 1. First, you need to initialize an LDA model setting 10 topics and 10 iterations of clustering:
scala> val lda = new LDA().setK(10).setMaxIter(10)
lda: org.apache.spark.ml.clustering.LDA = lda_18f248b08480
Step 2. Next invoking the fit() function on the input dataset...