Pipeline overview
In Chapter 4, Installing Pachyderm Locally, and Chapter 5, Installing Pachyderm on a Cloud Platform, we learned how to deploy Pachyderm locally or on a cloud platform. By now, you should have some version of Pachyderm up and running, either on your computer or a cloud platform. Now, let's create our first pipeline.
A Pachyderm pipeline is a technique that processes data from a Pachyderm input repository or repositories and uploads it to a Pachyderm output repository. Every time new data is uploaded to the input repository, the pipeline automatically processes it. Every time new data lands in the repository, it is recorded as a commit hash and can be accessed, rerun, or analyzed later. Therefore, a pipeline is an essential component of the Pachyderm ecosystem that ensures the reproducibility of your data science workloads.
To get you started quickly, we have prepared a simple example of image processing that will draw a contour on an image. A contour is...