The final thing
As we mentioned earlier, one of the interesting additions to spark 2.0.0 is the ML pipeline. A pipeline is nothing but a linear graph of transformers and estimators. If we look at the classes we have been using, they are either transformers or estimators. We had a decent pipeline for our classification example, as follows:
We started with Passengers, which was the Dataset that we read in.
- Passengers1 was after the feature extraction.
- Passenders2 was after
StringIndexer
. - Passengers3 was after the
na.drop()
function. - Passengers4 was after the
VectorAssembler()
function. - The
algTree
object was the algorithm object.
We would have created a pipeline:
valtreePipeline = new Pipeline().setStages(Array(indexer, assembler, algTree))
Then, we would have created a model:
valmdlTree = treePipeline.fit(trainData)
Finally, we would have predicted as usual:
val predictions = mdlTree.transform(testData)
Of course, our original sequence won't work. We have to do na.drop()
on passenger1
and...