Though the spark.ml package uses the dataframe for ML workflows, depending on the use case one might need to extract data from raw dataframe or transform the dataframe in a format as required by the ML algorithms or at times one might just need a few selected parameters as feature vectors. All these different types of operations require usage of specially developed APIs that can be clubbed into the following categories.
Operations on feature vectors
Feature extractors
When the data present in a raw dataframe are not explicitly present in the form an ML algorithm expects we use feature extractors to extract those features. Common feature extractors are:
- CountVectorizer: A CountVectorizer converts a collection of text documents...