Using jug to break up your pipeline into tasks
Often, we have a simple pipeline: we preprocess the initial data, compute features, and then we need to call a machine learning algorithm with the resulting features.
Jug is a package developed by Luis Pedro Coelho, one of the authors of this book. It is open source (using the liberal MIT License) and can be useful in many areas but was designed specifically around data analysis problems. It simultaneously solves several problems, for example:
It can memorize results to a disk (or a database), which means that if you ask it to compute something you have computed before, the result is instead read from the disk.
It can use multiple cores or even multiple computers on a cluster. Jug was also designed to work very well in batch computing environments that use a queuing system such as Portable Batch System (PBS), the Load Sharing Facility (LSF), or the Oracle Grid Engine (OGE, earlier known as Sun Grid Engine). This will be used in the second half...