The final thing
As we mentioned earlier, one of the interesting additions to spark 2.0.0 is the ML pipeline. A pipeline is nothing but a linear graph of transformers and estimators. If we look at the classes we have been using, they are either transformers or estimators. We had a decent pipeline for our classification example, as follows:
We started with Passengers, which was the Dataset that we read in.
Passengers1 was after the feature extraction.
Passenders2 was after
StringIndexer
.Passengers3 was after the
na.drop()
function.Passengers4 was after the
VectorAssembler()
function.The
algTree
object was the algorithm object.
We would have created a pipeline:
valtreePipeline = new Pipeline().setStages(Array(indexer, assembler, algTree))
Then, we would have created a model:
valmdlTree = treePipeline.fit(trainData)
Finally, we would have predicted as usual:
val predictions = mdlTree.transform(testData)
Of course, our original sequence won't work. We have to do na.drop()
on passenger1...