Implementing the Iris pipeline

In this section, we will set forth what our pipeline implementation objectives are. We will document tangible results as we step through individual implementation steps.

Before we implement the Iris pipeline, we want to understand what a pipeline is from a conceptual and practical perspective. Therefore, we define a pipeline as a DataFrame processing workflow with multiple pipeline stages operating in a certain sequence.

A DataFrame is a Spark abstraction that provides an API. This API lets us work with collections of objects. At a high-level it represents a distributed collection holding rows of data, much like a relational database table. Each member of a row (for example, a Sepal-Width measurement) in this DataFrame falls under a named column called Sepal-Width.

Each stage in a pipeline is an algorithm that is either a Transformer or an Estimator. As a DataFrame or DataFrame(s) flow through the pipeline, two types of stages (algorithms) exist:

Transformer stage: This involves a transformation action that transforms one DataFrame into another DataFrame
Estimator stage: This involves a training action on a DataFrame that produces another DataFrame.

In summary, a pipeline is a single unit, requiring stages, but inclusive of parameters and DataFrame(s). The entire pipeline structure is listed as follows:

Transformer
Estimator
Parameters (hyper or otherwise)
DataFrame

This is where Spark comes in. Its MLlib library provides a set of pipeline APIs allowing developers to access multiple algorithms and facilitates their combining into a single pipeline of ordered stages, much like a sequence of choreographed motions in a ballet. In this chapter, we will use the random forest classifier.

We covered essential pipeline concepts. These are practicalities that will help us move into the section, where we will list implementation objectives.

Iris pipeline implementation objectives

Before listing the implementation objectives, we will lay out an architecture for our pipeline. Shown here under are two diagrams representing an ML workflow, a pipeline.

The following diagrams together help in understanding the different components of this project. That said, this pipeline involves training (fitting), transformation, and validation operations. More than one model is trained and the best model (or mapping function) is selected to give us an accurate approximation predicting the species of an Iris flower (based on measurements of those flowers):

Project block diagram

A breakdown of the project block diagram is as follows:

Spark, which represents the Spark cluster and its ecosystem
Training dataset
Model
Dataset attributes or feature measurements
An inference process, that produces a prediction column

The following diagram represents a more detailed description of the different phases in terms of the functions performed in each phase. Later we will come to visualize pipeline in terms of its constituent stages.

For now, the diagram depicts four stages, starting with a data pre-processing phase, which is considered separate from the numbered phases deliberately. Think of the pipeline as a two-step process:

A data cleansing phase, or pre-processing phase. An important phase that could include a subphase of Exploratory Data Analysis (EDA) (not explicitly depicted in the latter diagram).
A data analysis phase that begins with Feature Extraction, followed by Model Fitting, and Model validation, all the way to deployment of an Uber pipeline JAR into Spark:

Pipeline diagram

Referring to the preceding diagram, the first implementation objective is to set up Spark inside an SBT project. An SBT project is a self-contained application, which we can run on the command line to predict Iris labels. In the SBT project, dependencies are specified in a build.sbt file and our application code will create its own SparkSession and SparkContext.

So that brings us to a listing of implementation objectives and these are as follows:

Get the Iris dataset from the UCI Machine Learning Repository
Conduct preliminary EDA in the Spark shell
Create a new Scala project in IntelliJ, and carry out all implementation steps, until the evaluation of the Random Forest classifier
Deploy the application to your local Spark cluster

Step 1 – getting the Iris dataset from the UCI Machine Learning Repository

Head over to the UCI Machine Learning Repository website at https://archive.ics.uci.edu/ml/datasets/iris and click on Download: Data Folder. Extract this folder someplace convenient and copy over iris.csv into the root of your project folder.

You may refer back to the project overview for an in-depth description of the Iris dataset. We depict the contents of the iris.csv file here, as follows:

A snapshot of the Iris dataset with 150 sets

You may recall that the iris.csv file is a 150-row file, with comma-separated values.

Now that we have the dataset, the first step will be performing EDA on it. The Iris dataset is multivariate, meaning there is more than one (independent) variable, so we will carry out a basic multivariate EDA on it. But we need DataFrame to let us do that. How we create a dataframe as a prelude to EDA is the goal of the next section.

Step 2 – preliminary EDA

Before we get down to building the SBT pipeline project, we will conduct a preliminary EDA in spark-shell. The plan is to derive a dataframe out of the dataset and then calculate basic statistics on it.

We have three tasks at hand for spark-shell:

Fire up spark-shell
Load the iris.csv file and build DataFrame
Calculate the statistics

We will then port that code over to a Scala file inside our SBT project.

That said, let's get down to loading the iris.csv file (inputting the data source) before eventually building DataFrame.

Firing up Spark shell

Fire up the Spark Shell by issuing the following command on the command line.

spark-shell --master local[2]

In the next step, we start with the available Spark session 'spark'. 'spark' will be our entry point to programming with Spark. It also holds properties required to connect to our Spark (local) cluster. With this information, our next goal is to load the iris.csv file and produce a DataFrame

Loading the iris.csv file and building a DataFrame

The first step to loading the iris csv file is to invoke the read method on spark. The read method returns DataFrameReader, which can be used to read our dataset:

val dfReader1 = spark.read
dfReader1: org.apache.spark.sql.DataFrameReader=org.apache.spark.sql.DataFrameReader@6980d3b3

dfReader1 is of type org.apache.spark.sql.DataFrameReader. Calling the format method on dfReader1 with Spark's com.databricks.spark.csv CSV format-specifier string returns DataFrameReader again:

val dfReader2 = dfReader1.format("com.databricks.spark.csv")
dfReader2: org.apache.spark.sql.DataFrameReader=org.apache.spark.sql.DataFrameReader@6980d3b3

After all, iris.csv is a CSV file.

Needless to say, dfReader1 and dfReader2 are the same DataFrameReader instance.

At this point, DataFrameReader needs an input data source option in the form of a key-value pair. Invoke the option method with two arguments, a key "header" of type string and its value true of type Boolean:

val dfReader3 = dfReader2.option("header", true)

In the next step, we invoke the option method again with an argument inferSchema and a true value:

val dfReader4 = dfReader3.option("inferSchema", true)

What is inferSchema doing here? We are simply telling Spark to guess the schema of our input data source for us.

Up until now, we have been preparing DataFrameReader to load iris.csv. External data sources require a path for Spark to load the data for DataFrameReader to process and spit out DataFrame.

The time is now right to invoke the load method on DataFrameReader dfReader4. Pass into the load method the path to the Iris dataset file. In this case, the file is right under the root of the project folder:

val dFrame1 = dfReader4.load("iris.csv")
dFrame1: org.apache.spark.sql.DataFrame = [Id: int, SepalLengthCm: double ... 4 more fields]

That's it. We now have DataFrame!

Calculating statistics

Invoking the describe method on this DataFrame should cause Spark to perform a basic statistical analysis on each column of DataFrame:

dFrame1.describe("Id","SepalLengthCm","SepalWidthCm","PetalLengthCm","PetalWidthCm","Species")
WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
res16: org.apache.spark.sql.DataFrame = [summary: string, Id: string ... 5 more fields]

Lets fix the WARN.Utils issue described in the preceding code block. The fix is to locate the file spark-defaults-template.sh under SPARK_HOME/conf and save it as spark-defaults.sh.

At the bottom of this file, add an entry for spark.debug.maxToStringFields. The following screenshot illustrates this:

Fixing the WARN Utils problem in spark-defaults.sh

Save the file and restart spark-shell.

Now, inspect the updated Spark configuration again. We updated the value of spark.debug.maxToStringFields in the spark-defaults.sh file. This change is supposed to fix the truncation problem reported by Spark. We will confirm imminently that the change we made caused Spark to update its configuration also. That is easily done by inspecting SparkConf.

Inspecting your SparkConf again

As before, invoking the getConf returns the SparkContext instance that stores configuration values. Invoking getAll on that instance returns an Array of configuration values. One of those values is an updated value of spark.debug.maxToStringFields:

sc.getConf.getAll
res4: Array[(String, String)] = Array((spark.repl.class.outputDir,C:\Users\Ilango\AppData\Local\Temp\spark-10e24781-9aa8-495c-a8cc-afe121f8252a\repl-c8ccc3f3-62ee-46c7-a1f8-d458019fa05f), (spark.app.name,Spark shell), (spark.sql.catalogImplementation,hive), (spark.driver.port,58009), (spark.debug.maxToStringFields,150),

That updated value for spark.debug.maxToStringFields is now 150.

spark.debug.maxToStringFields had a default value of 25 inside a private object called Utils.

Calculating statistics again

Run the invoke on the dataframe describe method and pass to it column names:

val dFrame2 =  dFrame1.describe("Id","SepalLengthCm","SepalWidthCm","PetalLengthCm","PetalWidthCm","Species"
)
dFrame2: org.apache.spark.sql.DataFrame = [summary: string, Id: string ... 5 more fields]

The invoke on the describe method of DataFrame dfReader results in a transformed DataFrame that we call dFrame2. On dFrame2, we invoke the show method to return a table of statistical results. This completes the first phase of a basic yet important EDA:

val dFrame2Display= = dfReader2.show

The results of the statistical analysis are shown in the following screenshot:

Results of statistical analysis

We did all that extra work simply to demonstrate the individual data reading, loading, and transformation stages. Next, we will wrap all of our previous work in one line of code:

val dfReader = spark.read.format("com.databricks.spark.csv").option("header",true).option("inferSchema",true).load("iris.csv")
dfReader: org.apache.spark.sql.DataFrame = [Id: int, SepalLengthCm: double ... 4 more fields]

That completes the EDA on spark-shell. In the next section, we undertake steps to implement, build (using SBT), deploy (using spark-submit), and execute our Spark pipeline application. We start by creating a skeletal SBT project.

Step 3 – creating an SBT project

Lay out your SBT project in a folder of your choice and name it IrisPipeline or any name that makes sense to you. This will hold all of our files needed to implement and run the pipeline on the Iris dataset.

The structure of our SBT project looks like the following:

Project structure

We will list dependencies in the build.sbt file. This is going to be an SBT project. Hence, we will bring in the following key libraries:

Spark Core
Spark MLlib
Spark SQL

The following screenshot illustrates the build.sbt file:

The build.sbt file with Spark dependencies

The build.sbt file referenced in the preceding snapshot is readily available for you in the book's download bundle. Drill down to the folder Chapter01 code under ModernScalaProjects_Code and copy the folder over to a convenient location on your computer.

Drop the iris.csv file that you downloaded in Step 1 – getting the Iris dataset from the UCI Machine Learning Repository into the root folder of our new SBT project. Refer to the earlier screenshot that depicts the updated project structure with the iris.csv file inside of it.

Step 4 – creating Scala files in SBT project

Step 4 is broken down into the following steps:

Create the Scala file iris.scala in the com.packt.modern.chapter1 package.
Up until now, we relied on SparkSession and SparkContext, which spark-shell gave us. This time around, we need to create SparkSession, which will, in turn, give us SparkContext.

What follows is how the code is laid out in the iris.scala file.

In iris.scala, after the package statement, place the following import statements:

import org.apache.spark.sql.SparkSession

Create SparkSession inside a trait, which we shall call IrisWrapper:

lazy val session: SparkSession = SparkSession.builder().getOrCreate()

Just one SparkSession is made available to all classes extending from IrisWrapper. Create val to hold the iris.csv file path:

val dataSetPath = "<<path to folder containing your iris.csv file>>\\iris.csv"

Create a method to build DataFrame. This method takes in the complete path to the Iris dataset path as String and returns DataFrame:

def buildDataFrame(dataSet: String): DataFrame = {
/*
 The following is an example of a dataSet parameter string: "C:\\Your\\Path\\To\\iris.csv"
*/

Import the DataFrame class by updating the previous import statement for SparkSession:

import org.apache.spark.sql.{DataFrame, SparkSession}

Create a nested function inside buildDataFrame to process the raw dataset. Name this function getRows. getRows which takes no parameters but returns Array[(Vector, String)]. The textFile method on the SparkContext variable processes the iris.csv into RDD[String]:

val result1: Array[String] = session.sparkContext.textFile(<<path to iris.csv represented by the dataSetPath variable>>)

The resulting RDD contains two partitions. Each partition, in turn, contains rows of strings separated by a newline character, '\n'. Each row in the RDD represents its original counterpart in the raw data.

In the next step, we will attempt several data transformation steps. We start by applying a flatMap operation over the RDD, culminating in the DataFrame creation. DataFrame is a view over Dataset, which happens to the fundamental data abstraction unit in the Spark 2.0 line.

Step 5 – preprocessing, data transformation, and DataFrame creation

We will get started by invoking flatMap, by passing a function block to it, and successive transformations listed as follows, eventually resulting in Array[(org.apache.spark.ml.linalg.Vector, String)]. A vector represents a row of feature measurements.

The Scala code to give us Array[(org.apache.spark.ml.linalg.Vector, String)] is as follows:

//Each line in the RDD is a row in the Dataset represented by a String, which we can 'split' along the new //line character
val result2: RDD[String] = result1.flatMap { partition => partition.split("\n").toList }

//the second transformation operation involves a split inside of each line in the dataset where there is a //comma separating each element of that line
val result3: RDD[Array[String]] = result2.map(_.split(","))

Next, drop the header column, but not before doing a collection that returns an Array[Array[String]]:

val result4: Array[Array[String]] = result3.collect.drop(1)

The header column is gone; now import the Vectors class:

import org.apache.spark.ml.linalg.Vectors

Now, transform Array[Array[String]] into Array[(Vector, String)]:

val result5 = result4.map(row => (Vectors.dense(row(1).toDouble, row(2).toDouble, row(3).toDouble, row(4).toDouble),row(5)))

The last step remaining is to create a final DataFrame

DataFrame Creation

Now, we invoke the createDataFrame method with a parameter, getRows. This returns DataFrame with featureVector and speciesLabel (for example, Iris-setosa):

val dataFrame = spark.createDataFrame(result5).toDF(featureVector, speciesLabel)

Display the top 20 rows in the new dataframe:

dataFrame.show
+--------------------+-------------------------+
|iris-features-column|iris-species-label-column|
+--------------------+-------------------------+
| [5.1,3.5,1.4,0.2]| Iris-setosa|
| [4.9,3.0,1.4,0.2]| Iris-setosa|
| [4.7,3.2,1.3,0.2]| Iris-setosa|
.....................
.....................
+--------------------+-------------------------+
only showing top 20 rows

We need to index the species label column by converting the strings Iris-setosa, Iris-virginica, and Iris-versicolor into doubles. We will use a StringIndexer to do that.

Now create a file called IrisPipeline.scala.

Create an object IrisPipeline that extends our IrisWrapper trait:

object IrisPipeline extends IrisWrapper {

Import the StringIndexer algorithm class:

import org.apache.spark.ml.feature.StringIndexer

Now create a StringIndexer algorithm instance. The StringIndexer will map our species label column to an indexed learned column:

val indexer = new StringIndexer().setInputCol
(irisFeatures_CategoryOrSpecies_IndexedLabel._2).setOutputCol(irisFeatures_CategoryOrSpecies_IndexedLabel._3)

Step 6 – creating, training, and testing data

Now, let's split our dataset in two by providing a random seed:

val splitDataSet: Array[org.apache.spark.sql.Dataset
[org.apache.spark.sql.Row]] = dataSet.randomSplit(Array(0.85, 0.15), 98765L)

Now our new splitDataset contains two datasets:

Train dataset: A dataset containing Array[(Vector, iris-species-label-column: String)]
Test dataset: A dataset containing Array[(Vector, iris-species-label-column: String)]

Confirm that the new dataset is of size 2:

splitDataset.size
res48: Int = 2

Assign the training dataset to a variable, trainSet:

val trainDataSet = splitDataSet(0)
trainSet: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [iris-features-column: vector, iris-species-label-column: string]

Assign the testing dataset to a variable, testSet:

val testDataSet = splitDataSet(1)
testSet: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [iris-features-column: vector, iris-species-label-column: string]

Count the number of rows in the training dataset:

trainSet.count
res12: Long = 14

Count the number of rows in the testing dataset:

testSet.count
res9: Long = 136

There are 150 rows in all.

Step 7 – creating a Random Forest classifier

In reference to Step 5 - DataFrame Creation. This DataFrame 'dataFrame' contains column names that corresponds to the columns present in the DataFrame produced in that step

The first step to create a classifier is to pass into it (hyper) parameters. A fairly comprehensive list of parameters look like this:

From 'dataFrame' we need the Features column name - iris-features-column
From 'dataFrame' we also need the Indexed label column name - iris-species-label-column
The sqrt setting for featureSubsetStrategy
Number of features to be considered per split (we have 150 observations and four features that will make our max_features value 2)
Impurity settings—values can be gini and entropy
Number of trees to train (since the number of trees is greater than one, we set a tree maximum depth), which is a number equal to the number of nodes
The required minimum number of feature measurements (sampled observations), also known as the minimum instances per node

Look at the IrisPipeline.scala file for values of each of these parameters.

But this time, we will employ an exhaustive grid search-based model selection process based on combinations of parameters, where parameter value ranges are specified.

Create a randomForestClassifier instance. Set the features and featureSubsetStrategy:

val randomForestClassifier = new RandomForestClassifier()
  .setFeaturesCol(irisFeatures_CategoryOrSpecies_IndexedLabel._1)
  .setFeatureSubsetStrategy("sqrt")

Start building Pipeline, which has two stages, Indexer and Classifier:

val irisPipeline = new Pipeline().setStages(Array[PipelineStage](indexer) ++  Array[PipelineStage](randomForestClassifier))

Next, set the hyperparameter num_trees (number of trees) on the classifier to 15, a Max_Depth parameter, and an impurity with two possible values of gini and entropy.

Build a parameter grid with all three hyperparameters:

val finalParamGrid: Array[ParamMap] = gridBuilder3.build()

Step 8 – training the Random Forest classifier

Next, we want to split our training set into a validation set and a training set:

val validatedTestResults: DataFrame = new TrainValidationSplit()

On this variable, set Seed, set EstimatorParamMaps, set Estimator with irisPipeline, and set a training ratio to 0.8:

val validatedTestResults: DataFrame = new TrainValidationSplit().setSeed(1234567L).setEstimator(irisPipeline)

Finally, do a fit and a transform with our training dataset and testing dataset. Great! Now the classifier is trained. In the next step, we will apply this classifier to testing the data.

Step 9 – applying the Random Forest classifier to test data

The purpose of our validation set is to be able to make a choice between models. We want an evaluation metric and hyperparameter tuning. We will now create an instance of a validation estimator called TrainValidationSplit, which will split the training set into a validation set and a training set:

val validatedTestResults.setEvaluator(new MulticlassClassificationEvaluator())

Next, we fit this estimator over the training dataset to produce a model and a transformer that we will use to transform our testing dataset. Finally, we perform a validation for hyperparameter tuning by applying an evaluator for a metric.

The new ValidatedTestResults DataFrame should look something like this:

--------+
 |iris-features-column|iris-species-column|label| rawPrediction| probability|prediction|
 +--------------------+-------------------+-----+--------------------+
 | [4.4,3.2,1.3,0.2]| Iris-setosa| 0.0| [40.0,0.0,0.0]| [1.0,0.0,0.0]| 0.0|
 | [5.4,3.9,1.3,0.4]| Iris-setosa| 0.0| [40.0,0.0,0.0]| [1.0,0.0,0.0]| 0.0|
 | [5.4,3.9,1.7,0.4]| Iris-setosa| 0.0| [40.0,0.0,0.0]| [1.0,0.0,0.0]| 0.0|

Let's return a new dataset by passing in column expressions for prediction and label:

val validatedTestResultsDataset:DataFrame = validatedTestResults.select("prediction", "label")

In the line of code, we produced a new DataFrame with two columns:

An input label
A predicted label, which is compared with its corresponding value in the input label column

That brings us to the next step, an evaluation step. We want to know how well our model performed. That is the goal of the next step.

Step 10 – evaluate Random Forest classifier

In this section, we will test the accuracy of the model. We want to know how well our model performed. Any ML process is incomplete without an evaluation of the classifier.

That said, we perform an evaluation as a two-step process:

Evaluate the model output
Pass in three hyperparameters:

val modelOutputAccuracy: Double = new MulticlassClassificationEvaluator()

Set the label column, a metric name, the prediction column label, and invoke evaluation with the validatedTestResults dataset.

Note the accuracy of the model output results on the testing dataset from the modelOutputAccuracy variable.

The other metrics to evaluate are how close the predicted label value in the 'predicted' column is to the actual label value in the (indexed) label column.

Next, we want to extract the metrics:

val multiClassMetrics = new MulticlassMetrics(validatedRDD2)

Our pipeline produced predictions. As with any prediction, we need to have a healthy degree of skepticism. Naturally, we want a sense of how our engineered prediction process performed. The algorithm did all the heavy lifting for us in this regard. That said, everything we did in this step was done for the purpose of evaluation. Who is being evaluated here or what evaluation is worth reiterating? That said, we wanted to know how close the predicted values were compared to the actual label value. To obtain that knowledge, we decided to use the MulticlassMetrics class to evaluate metrics that will give us a measure of the performance of the model via two methods:

Accuracy
Weighted precision

The following lines of code will give us value of Accuracy and Weighted Precision. First we will create an accuracyMetrics tuple, which should contain the values of both accuracy and weighted precision

val accuracyMetrics = (multiClassMetrics.accuracy, multiClassMetrics.weightedPrecision)

Obtain the value of accuracy.

val accuracy = accuracyMetrics._1

Next, obtain the value of weighted precision.


val weightedPrecsion = accuracyMetrics._2

These metrics represent evaluation results for our classifier or classification model. In the next step, we will run the application as a packaged SBT application.

Step 11 – running the pipeline as an SBT application

At the root of your project folder, issue the sbt console command, and in the Scala shell, import the IrisPipeline object and then invoke the main method of IrisPipeline with the argument iris:

sbt console
scala>
import com.packt.modern.chapter1.IrisPipeline
IrisPipeline.main(Array("iris")
Accuracy (precision) is 0.9285714285714286 Weighted Precision is: 0.9428571428571428

In the next section, we will show you how to package the application so that it is ready to be deployed into Spark as an Uber JAR.

Step 12 – packaging the application

In the root folder of your SBT application, run:

sbt package

When SBT is done packaging, the Uber JAR can be deployed into our cluster, using spark-submit, but since we are in standalone deploy mode, it will be deployed into [local]:

The application JAR file

The package command created a JAR file that is available under the target folder. In the next section, we will deploy the application into Spark.

Step 13 – submitting the pipeline application to Spark local

At the root of the application folder, issue the spark-submit command with the class and JAR file path arguments, respectively.

If everything went well, the application does the following:

Loads up the data.
Performs EDA.
Creates training, testing, and validation datasets.
Creates a Random Forest classifier model.
Trains the model.
Tests the accuracy of the model. This is the most important part—the ML classification task.

To accomplish this, we apply our trained Random Forest classifier model to the test dataset. This dataset consists of Iris flower data of so far not seen by the model. Unseen data is nothing but Iris flowers picked in the wild.
Applying the model to the test dataset results in a prediction about the species of an unseen (new) flower.
The last part is where the pipeline runs an evaluation process, which essentially is about checking if the model reports the correct species.
Lastly, pipeline reports back on how important a certain feature of the Iris flower turned out to be. As a matter of fact, the petal width turns out to be more important than the sepal width in carrying out the classification task.

That brings us to the last section of this chapter. We will summarize what we have learned. Not only that, we will give readers a glimpse into what they will learn in the next chapter.