Overview of the package
At the top level, the package exposes three main abstract classes: a Transformer
, an Estimator
, and a Pipeline
. We will shortly explain each with some short examples. We will provide more concrete examples of some of the models in the last section of this chapter.
Transformer
The Transformer
class, like the name suggests, transforms your data by (normally) appending a new column to your DataFrame.
At the high level, when deriving from the Transformer
abstract class, each and every new Transformer
needs to implement a .transform(...)
method. The method, as a first and normally the only obligatory parameter, requires passing a DataFrame to be transformed. This, of course, varies method-by-method in the ML package: other popular parameters are inputCol
and outputCol
; these, however, frequently default to some predefined values, such as, for example, 'features'
for the inputCol
parameter.
There are many Transformers
offered in the spark.ml.feature
and we will briefly describe...