Pipelining 2.0
In Chapter 4, Packaging Up, we discussed the benefits of writing our ML code as pipelines. We discussed how to implement some basic ML pipelines using tools such as sklearn and Spark MLlib. The pipelines we were concerned with there were very nice ways of streamlining your code and making several processes available to use within a single object to simplify an application. However, everything we discussed then was very much focused within one Python file and not necessarily something we could extend very flexibly outside the confines of the package we were using. With the techniques we discussed, for example, it would be very difficult to create pipelines where each step was using a different package or even where they were entirely different programs. They did not allow us to build much sophistication into our data flows or application logic either, as if one of the steps failed, the pipeline failed, and that was that.
The tools we are about to discuss take these...