CountVectorizer is used to convert a collection of text documents to vectors of token counts essentially producing sparse representations for the documents over the vocabulary. The end result is a vector of features, which can then be passed to other algorithms. Later on, we will see how to use the output from the CountVectorizer in LDA algorithm to perform topic detection.
In order to invoke CountVectorizer, you need to import the package:
import org.apache.spark.ml.feature.CountVectorizer
First, you need to initialize a CountVectorizer Transformer specifying the input column and the output column. Here, we are choosing the filteredWords column created by the StopWordRemover and generate output column features:
scala> val countVectorizer = new CountVectorizer().setInputCol("filteredWords").setOutputCol("features")
countVectorizer: org.apache...