The Iris dataset is the simplest, yet the most famous data analysis task in the ML space. In this article, you will build a solution for data analysis & classification task from an Iris dataset using Scala.
This article is an excerpt taken from Modern Scala Projects written by Ilango Gurusamy.
The following diagrams together help in understanding the different components of this project. That said, this pipeline involves training (fitting), transformation, and validation operations. More than one model is trained and the best model (or mapping function) is selected to give us an accurate approximation predicting the species of an Iris flower (based on measurements of those flowers):
Project block diagram
A breakdown of the project block diagram is as follows:
The following diagram represents a more detailed description of the different phases in terms of the functions performed in each phase. Later we will come to visualize pipeline in terms of its constituent stages.
For now, the diagram depicts four stages, starting with a data pre-processing phase, which is considered separate from the numbered phases deliberately. Think of the pipeline as a two-step process:
Pipeline diagram
Referring to the preceding diagram, the first implementation objective is to set up Spark inside an SBT project. An SBT project is a self-contained application, which we can run on the command line to predict Iris labels. In the SBT project, dependencies are specified in a build.sbt file and our application code will create its own SparkSession and SparkContext.
So that brings us to a listing of implementation objectives and these are as follows:
Head over to the UCI Machine Learning Repository website at https://archive.ics.uci.edu/ml/datasets/iris and click on Download: Data Folder. Extract this folder someplace convenient and copy over iris.csv into the root of your project folder.
You may refer back to the project overview for an in-depth description of the Iris dataset. We depict the contents of the iris.csv file here, as follows:
A snapshot of the Iris dataset with 150 sets
You may recall that the iris.csv file is a 150-row file, with comma-separated values.
Now that we have the dataset, the first step will be performing EDA on it. The Iris dataset is multivariate, meaning there is more than one (independent) variable, so we will carry out a basic multivariate EDA on it. But we need DataFrame to let us do that. How we create a dataframe as a prelude to EDA is the goal of the next section.
Before we get down to building the SBT pipeline project, we will conduct a preliminary EDA in spark-shell. The plan is to derive a dataframe out of the dataset and then calculate basic statistics on it.
We have three tasks at hand for spark-shell:
We will then port that code over to a Scala file inside our SBT project.
That said, let's get down to loading the iris.csv file (inputting the data source) before eventually building DataFrame.
Lay out your SBT project in a folder of your choice and name it IrisPipeline or any name that makes sense to you. This will hold all of our files needed to implement and run the pipeline on the Iris dataset.
The structure of our SBT project looks like the following:
Project structure
We will list dependencies in the build.sbt file. This is going to be an SBT project. Hence, we will bring in the following key libraries:
The following screenshot illustrates the build.sbt file:
The build.sbt file with Spark dependencies
The build.sbt file referenced in the preceding snapshot is readily available for you in the book's download bundle. Drill down to the folder Chapter01 code under ModernScalaProjects_Code and copy the folder over to a convenient location on your computer.
Drop the iris.csv file that you downloaded in Step 1 – getting the Iris dataset from the UCI Machine Learning Repository into the root folder of our new SBT project. Refer to the earlier screenshot that depicts the updated project structure with the iris.csv file inside of it.
Step 4 is broken down into the following steps:
What follows is how the code is laid out in the iris.scala file.
In iris.scala, after the package statement, place the following import statements:
import org.apache.spark.sql.SparkSession
Create SparkSession inside a trait, which we shall call IrisWrapper:
lazy val session: SparkSession = SparkSession.builder().getOrCreate()
Just one SparkSession is made available to all classes extending from IrisWrapper. Create val to hold the iris.csv file path:
val dataSetPath = "<<path to folder containing your iris.csv file>>\\iris.csv"
Create a method to build DataFrame. This method takes in the complete path to the Iris dataset path as String and returns DataFrame:
def buildDataFrame(dataSet: String): DataFrame = { /* The following is an example of a dataSet parameter string: "C:\\Your\\Path\\To\\iris.csv" */
Import the DataFrame class by updating the previous import statement for SparkSession:
import org.apache.spark.sql.{DataFrame, SparkSession}
Create a nested function inside buildDataFrame to process the raw dataset. Name this function getRows. getRows which takes no parameters but returns Array[(Vector, String)]. The textFile method on the SparkContext variable processes the iris.csv into RDD[String]:
val result1: Array[String] = session.sparkContext.textFile(<<path to iris.csv represented by the dataSetPath variable>>)
The resulting RDD contains two partitions. Each partition, in turn, contains rows of strings separated by a newline character, '\n'. Each row in the RDD represents its original counterpart in the raw data.
In the next step, we will attempt several data transformation steps. We start by applying a flatMap operation over the RDD, culminating in the DataFrame creation. DataFrame is a view over Dataset, which happens to the fundamental data abstraction unit in the Spark 2.0 line.
We will get started by invoking flatMap, by passing a function block to it, and successive transformations listed as follows, eventually resulting in Array[(org.apache.spark.ml.linalg.Vector, String)]. A vector represents a row of feature measurements.
The Scala code to give us Array[(org.apache.spark.ml.linalg.Vector, String)] is as follows:
//Each line in the RDD is a row in the Dataset represented by a String, which we can 'split' along the new //line character val result2: RDD[String] = result1.flatMap { partition => partition.split("\n").toList }
//the second transformation operation involves a split inside of each line in the dataset where there is a //comma separating each element of that line
val result3: RDD[Array[String]] = result2.map(_.split(","))
Next, drop the header column, but not before doing a collection that returns an Array[Array[String]]:
val result4: Array[Array[String]] = result3.collect.drop(1)
The header column is gone; now import the Vectors class:
import org.apache.spark.ml.linalg.Vectors
Now, transform Array[Array[String]] into Array[(Vector, String)]:
val result5 = result4.map(row => (Vectors.dense(row(1).toDouble, row(2).toDouble, row(3).toDouble, row(4).toDouble),row(5)))
Now, let's split our dataset in two by providing a random seed:
val splitDataSet: Array[org.apache.spark.sql.Dataset [org.apache.spark.sql.Row]] = dataSet.randomSplit(Array(0.85, 0.15), 98765L)
Now our new splitDataset contains two datasets:
Confirm that the new dataset is of size 2:
splitDataset.size res48: Int = 2
Assign the training dataset to a variable, trainSet:
val trainDataSet = splitDataSet(0) trainSet: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [iris-features-column: vector, iris-species-label-column: string]
Assign the testing dataset to a variable, testSet:
val testDataSet = splitDataSet(1) testSet: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [iris-features-column: vector, iris-species-label-column: string]
Count the number of rows in the training dataset:
trainSet.count res12: Long = 14
Count the number of rows in the testing dataset:
testSet.count res9: Long = 136
There are 150 rows in all.
In reference to Step 5 - DataFrame Creation. This DataFrame 'dataFrame' contains column names that corresponds to the columns present in the DataFrame produced in that step
The first step to create a classifier is to pass into it (hyper) parameters. A fairly comprehensive list of parameters look like this:
Look at the IrisPipeline.scala file for values of each of these parameters.
But this time, we will employ an exhaustive grid search-based model selection process based on combinations of parameters, where parameter value ranges are specified.
Create a randomForestClassifier instance. Set the features and featureSubsetStrategy:
val randomForestClassifier = new RandomForestClassifier() .setFeaturesCol(irisFeatures_CategoryOrSpecies_IndexedLabel._1) .setFeatureSubsetStrategy("sqrt")
Start building Pipeline, which has two stages, Indexer and Classifier:
val irisPipeline = new Pipeline().setStages(Array[PipelineStage](indexer) ++ Array[PipelineStage](randomForestClassifier))
Next, set the hyperparameter num_trees (number of trees) on the classifier to 15, a Max_Depth parameter, and an impurity with two possible values of gini and entropy.
Build a parameter grid with all three hyperparameters:
val finalParamGrid: Array[ParamMap] = gridBuilder3.build()
Next, we want to split our training set into a validation set and a training set:
val validatedTestResults: DataFrame = new TrainValidationSplit()
On this variable, set Seed, set EstimatorParamMaps, set Estimator with irisPipeline, and set a training ratio to 0.8:
val validatedTestResults: DataFrame = new TrainValidationSplit().setSeed(1234567L).setEstimator(irisPipeline)
Finally, do a fit and a transform with our training dataset and testing dataset. Great! Now the classifier is trained. In the next step, we will apply this classifier to testing the data.
The purpose of our validation set is to be able to make a choice between models. We want an evaluation metric and hyperparameter tuning. We will now create an instance of a validation estimator called TrainValidationSplit, which will split the training set into a validation set and a training set:
val validatedTestResults.setEvaluator(new MulticlassClassificationEvaluator())
Next, we fit this estimator over the training dataset to produce a model and a transformer that we will use to transform our testing dataset. Finally, we perform a validation for hyperparameter tuning by applying an evaluator for a metric.
The new ValidatedTestResults DataFrame should look something like this:
--------+ |iris-features-column|iris-species-column|label| rawPrediction| probability|prediction| +--------------------+-------------------+-----+--------------------+ | [4.4,3.2,1.3,0.2]| Iris-setosa| 0.0| [40.0,0.0,0.0]| [1.0,0.0,0.0]| 0.0| | [5.4,3.9,1.3,0.4]| Iris-setosa| 0.0| [40.0,0.0,0.0]| [1.0,0.0,0.0]| 0.0| | [5.4,3.9,1.7,0.4]| Iris-setosa| 0.0| [40.0,0.0,0.0]| [1.0,0.0,0.0]| 0.0|
Let's return a new dataset by passing in column expressions for prediction and label:
val validatedTestResultsDataset:DataFrame = validatedTestResults.select("prediction", "label")
In the line of code, we produced a new DataFrame with two columns:
That brings us to the next step, an evaluation step. We want to know how well our model performed. That is the goal of the next step.
In this section, we will test the accuracy of the model. We want to know how well our model performed. Any ML process is incomplete without an evaluation of the classifier.
That said, we perform an evaluation as a two-step process:
val modelOutputAccuracy: Double = new MulticlassClassificationEvaluator()
Set the label column, a metric name, the prediction column label, and invoke evaluation with the validatedTestResults dataset.
Note the accuracy of the model output results on the testing dataset from the modelOutputAccuracy variable.
The other metrics to evaluate are how close the predicted label value in the 'predicted' column is to the actual label value in the (indexed) label column.
Next, we want to extract the metrics:
val multiClassMetrics = new MulticlassMetrics(validatedRDD2)
Our pipeline produced predictions. As with any prediction, we need to have a healthy degree of skepticism. Naturally, we want a sense of how our engineered prediction process performed. The algorithm did all the heavy lifting for us in this regard. That said, everything we did in this step was done for the purpose of evaluation. Who is being evaluated here or what evaluation is worth reiterating? That said, we wanted to know how close the predicted values were compared to the actual label value. To obtain that knowledge, we decided to use the MulticlassMetrics class to evaluate metrics that will give us a measure of the performance of the model via two methods:
val accuracyMetrics = (multiClassMetrics.accuracy, multiClassMetrics.weightedPrecision)
val accuracy = accuracyMetrics._1
val weightedPrecsion = accuracyMetrics._2
These metrics represent evaluation results for our classifier or classification model. In the next step, we will run the application as a packaged SBT application.
At the root of your project folder, issue the sbt console command, and in the Scala shell, import the IrisPipeline object and then invoke the main method of IrisPipeline with the argument iris:
sbt console scala> import com.packt.modern.chapter1.IrisPipeline IrisPipeline.main(Array("iris") Accuracy (precision) is 0.9285714285714286 Weighted Precision is: 0.9428571428571428
In the root folder of your SBT application, run:
sbt package
When SBT is done packaging, the Uber JAR can be deployed into our cluster, using spark-submit, but since we are in standalone deploy mode, it will be deployed into [local]:
The application JAR file
The package command created a JAR file that is available under the target folder. In the next section, we will deploy the application into Spark.
At the root of the application folder, issue the spark-submit command with the class and JAR file path arguments, respectively.
If everything went well, the application does the following:
Thus we implemented an ML workflow or an ML pipeline. The pipeline combined several stages of data analysis into one workflow. We started by loading the data and from there on, we created training and test data, preprocessed the dataset, trained the RandomForestClassifier model, applied the Random Forest classifier to test data, evaluated the classifier, and computed a process that demonstrated the importance of each feature in the classification.
If you've enjoyed reading this post visit the book, Modern Scala Projects to build efficient data science projects that fulfill your software requirements.
Deep Learning Algorithms: How to classify Irises using multi-layer perceptrons
Introducing Android 9 Pie, filled with machine learning and baked-in UI features
Paper in Two minutes: A novel method for resource efficient image classification