Tokenizer converts the input string into lowercase and then splits the string with whitespaces into individual tokens. A given sentence is split into words either using the default space delimiter or using a customer regular expression based Tokenizer. In either case, the input column is transformed into an output column. In particular, the input column is usually a String and the output column is a Sequence of Words.
Tokenizers are available by importing two packages shown next, the Tokenizer and the RegexTokenize:
import org.apache.spark.ml.feature.Tokenizer
import org.apache.spark.ml.feature.RegexTokenizer
First, you need to initialize a Tokenizer specifying the input column and the output column:
scala> val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
tokenizer: org.apache.spark.ml.feature.Tokenizer = tok_942c8332b9d8...