In this section, we will set forth what our pipeline implementation objectives are. We will document tangible results as we step through individual implementation steps.
Before we implement the Iris pipeline, we want to understand what a pipeline is from a conceptual and practical perspective. Therefore, we define a pipeline as a DataFrame processing workflow with multiple pipeline stages operating in a certain sequence.
A DataFrame is a Spark abstraction that provides an API. This API lets us work with collections of objects. At a high-level it represents a distributed collection holding rows of data, much like a relational database table. Each member of a row (for example, a Sepal-Width measurement) in this DataFrame falls under a named column called Sepal-Width.
Each stage in a pipeline is an algorithm that is either a Transformer or an Estimator. As a DataFrame or DataFrame(s) flow through the pipeline, two types of stages (algorithms) exist:
- Transformer stage: This involves a transformation action that transforms one DataFrame into another DataFrame
- Estimator stage: This involves a training action on a DataFrame that produces another DataFrame.
In summary, a pipeline is a single unit, requiring stages, but inclusive of parameters and DataFrame(s). The entire pipeline structure is listed as follows:
- Transformer
- Estimator
- Parameters (hyper or otherwise)
- DataFrame
This is where Spark comes in. Its MLlib library provides a set of pipeline APIs allowing developers to access multiple algorithms and facilitates their combining into a single pipeline of ordered stages, much like a sequence of choreographed motions in a ballet. In this chapter, we will use the random forest classifier.
We covered essential pipeline concepts. These are practicalities that will help us move into the section, where we will list implementation objectives.