Summary
You've learned that, as in many other places, the introduction of DataFrames
leads to the development of complementary frameworks that are not using RDDs directly anymore. This is also the case for machine learning but there is much more to it. Pipeline
actually takes machine learning in Apache Spark to the next level as it improves the productivity of the data scientist dramatically.
The compatibility between all intermediate objects and well-thought-out concepts is just awesome. This framework makes it very easy to build your own stacked and bagged model with the full support of the underlying performance optimizations with Tungsten and Catalyst.
Great! Finally, we've applied the concepts that we discussed on a real dataset from a Kaggle competition, which is a very nice starting point for your own machine learning project with Apache SparkML. The next Chapter covers Apache SystemML, which is a 3rd party machine learning library for Apache Spark. Let's see why it is useful and what...