The most famous dataset in data science history is Sir Ronald Aylmer Fisher's classical Iris flower dataset, also known as Anderson's dataset. It was introduced in 1936, as a study in understanding multivariate (or multiclass) classification. What then is multivariate?
A multivariate classification problem
Understanding multivariate
The term multivariate can bear two meanings:
- In terms of an adjective, multivariate means having or involving one or more variables.
- In terms of a noun, multivariate may represent a mathematical vector whose individual elements are variate. Each individual element in this vector is a measurable quantity or variable.
Both meanings mentioned have a common denominator variable. Conducting a multivariate analysis of an experimental unit involves at least one measurable quantity or variable. A classic example of such an analysis is the Iris dataset, having one or more (outcome) variables per observation.
In this subsection, we understood multivariate in terms of variables. In the next subsection, we briefly touch upon different kinds of variables, one of them being categorical variables.
Different kinds of variables
In general, variables are of two types:
- Quantitative variable: It is a variable representing a measurement that is quantified by a numeric value. Some examples of quantitative variables are:
- A variable representing the age of a girl called Huan (Age_Huan). In September of 2017, the variable representing her age contained the value 24. Next year, one year later, that variable would be the number 1 (arithmetically) added to her current age.
- The variable representing the number of planets in the solar system (Planet_Number). Currently, pending the discovery of any new planets in the future, this variable contains the number 12. If scientists found a new celestial body tomorrow that they think qualifies to be a planet, the Planet_Number variable's new value would be bumped up from its current value of 12 to 13.Â
- Categorical variable: A variable that cannot be assigned a numerical measure in the natural order of things. For example, the status of an individual in the United States. It could be one of the following values: a citizen, permanent resident, or a non-resident.
In the next subsection, we will describe categorical variables in some detail.
Categorical variablesÂ
We will draw upon the definition of a categorical variable from the previous subsection. Categorical variables distinguish themselves from quantitative variables in a fundamental way. As opposed to a quantitative variable that represents a measure of a something in numerical terms, a categorical variable represents a grouping name or a category name, which can take one of the finite numbers of possible categories. For example, the species of an Iris flower is a categorical variable and the value it takes could be one value from a finite set of categorical values: Iris-setosa, Iris-virginica, and Iris-versicolor.
It may be useful to draw on other examples of categorical variables; these are listed here as follows:
- The blood group of an individual as in A+, A-, B+, B-, AB+, AB-, O+, or O-
- The county that an individual is a resident of given a finite list of counties in the state of Missouri
- The political affiliation of a United States citizen could take up categorical values in the form of Democrat, Republican, or Green Party
- In global warming studies, the type of a forest is a categorical variable that could take one of three values in the form of tropical, temperate, or taiga
The first item in the preceding list, the blood group of a person, is a categorical variable whose corresponding data (values) are categorized (classified) into eight groups (A, B, AB, or O with their positives or negatives). In a similar vein, the species of an Iris flower is a categorical variable whose data (values) are categorized (classified) into three species groups—Iris-setosa, Iris-versicolor, and Iris-virginica.Â
That said, a common data analysis task in ML is to index, or encode, current string representations of categorical values into a numeric form; doubles for example. Such indexing is a prelude to a prediction on the target or label, which we shall talk more about shortly.
In respect to the Iris flower dataset, its species variable data is subject to a classification (or categorization) task with the express purpose of being able to make a prediction on the species of an Iris flower. At this point, we want to examine the Iris dataset, its rows, row characteristics, and much more, which is the focus of the upcoming topic.
Fischer's Iris dataset
The Iris flower dataset comprises of a total of 150 rows, where each row represents one flower. Each row is also known as an observation. This 150 observation Iris dataset is made up of three kinds of observations related to three different Iris flower species. The following table is an illustration:
Referring to the preceding table, it is clear that three flower species are represented in the Iris dataset. Each flower species in this dataset contributes equally to 50 observations apiece. Each observation holds four measurements. One measurement corresponds to one flower feature, where each flower feature corresponds to one of the following:
- Sepal Length
- Sepal Width
- Petal Length
- Petal WidthÂ
The features listed earlier are illustrated in the following table for clarity:
Okay, so three flower species are represented in the Iris dataset. Speaking of species, we will henceforth replace the term species with the term classes whenever there is the need to stick to an ML terminology context. That means #1-Iris-setosa from earlier refers to Class # 1, #2-Iris-virginica to Class # 2, and #3-Iris-versicolor to Class # 3.
We just listed three different Iris flower species that are represented in the Iris dataset. What do they look like? What do their features look like? These questions are answered in the following screenshot:
Â
That said, let's look at the Sepal and Petal portions of each class of Iris flower. The Sepal (the larger lower part) and Petal (the lower smaller part) dimensions are how each class of Iris flower bears a relationship to the other two classes of Iris flowers. In the next section, we will summarize our discussion and expand the scope of the discussion of the Iris dataset to a multiclass, multidimensional classification task.
The Iris dataset represents a multiclass, multidimensional classification task
In this section, we will restate the facts about the Iris dataset and describe it in the context of an ML classification task:
- The Iris dataset classification task is multiclass because a prediction of the class of a new incoming Iris flower from the wild can belong to any of three classes.
- Indeed, this chapter is all about attempting a species classification (inferring the target class of a new Iris flower) using sepal and petal dimensions as feature parameters.
- The Iris dataset classification is multidimensional because there are four features.Â
- There are 150 observations, where each observation is comprised of measurements on four features. These measurements are also known by the following terms:
- Input attributes or instances
- Predictor variables (X)
- Input variables (X)
- Classification of an Iris flower picked in the wild is carried out by a model (the computed mapping function) that is given four flower feature measurements.
- The outcome of the Iris flower classification task is the identification of a (computed) predicted value for the response from the predictors by a process of learning (or fitting) a discrete number of targets or category labels (Y). The outcome or predicted value may mean the same as the following:
- Categorical response variable: In a later section, we shall see that an indexer algorithm will transform all categorical values to numbers
- Response or outcome variable (Y)
So far, we have claimed that the outcome (Y) of our multiclass classification task is dependent on inputs (X). Where will these inputs come from? This is answered in the next section.
The training dataset
An integral aspect of our data analysis or classification task we did not hitherto mention is the training dataset. A training dataset is our classification task's source of input data (X). We take advantage of this dataset to obtain a prediction on each target class, simply by deriving optimal perimeters or boundary conditions. We just redefined our classification process by adding in the extra detail of the training dataset. For a classification task, then we have X on one side and Y on the other, with an inferred mapping function in the middle. That brings us to the mapping or predictor function, which is the focus of the next section.
The mapping function
We have so far talked about an input variable (X) and an output variable (Y). The goal of any classification task, therefore, is to discover patterns and find a mapping (predictor) function that will take feature measurements (X) and map input over to the output (Y). That function is mathematically formulated as:
Y = f(x)
This mapping is how supervised learning works. A supervised learning algorithm is said to learn or discover this function. This will be the goal of the next section.
An algorithm and its mapping functionÂ
This section starts with a schematic depicting the components of the mapping function and an algorithm that learns the mapping function. The algorithm is learning the mapping function, as shown in the following diagram:
The goal of our classification process is to let the algorithm derive the best possible approximation of a mapping function by a learning (or fitting) process. When we find an Iris flower out in the wild and want to classify it, we use its input measurements as new input data that our algorithm's mapping function will accept in order to give us a predictor value (Y). In other words, given feature measurements of an Iris flower (the new data), the mapping function produced by a supervised learning algorithm (this will be a random forest) will classify the flower.
Two kinds of ML problems exist that supervised learning classification algorithms can solve. These are as follows:
- Classification tasks
- Regression tasks
In the following paragraph, we will talk about a mapping function with an example. We explain the role played by a "supervised learning classification task" in deducing the mapping function. The concept of a model is introduced.
Let's say we already knew that the (mapping) function f(x) for the Iris dataset classification task is exactly of the form x + 1,  then there is there no need for us to find a new mapping function. If we recall, a mapping function is one that maps the relationship between flower features, such as sepal length and sepal width, on the species the flower belongs to? No.
Therefore, there is no preexisting function x + 1 that clearly maps the relationship between flower features and the flower's species. What we need is a model that will model the aforementioned relationship as closely as possible. Data and its classification seldom tend to be straightforward. A supervised learning classification task starts life with no knowledge of what function f(x) is. A supervised learning classification process applies ML techniques and strategies in an iterative process of deduction to ultimately learn what f(x) is.
In our case, such an ML endeavor is a classification task, a task where the function or mapping function is referred to in statistical or ML terminology as a model.
In the next section, we will describe what supervised learning is and how it relates to the Iris dataset classification.  Indeed, this apparently simplest of ML techniques finds wide applications in data analysis, especially in the business domain.
Supervised learning – how it relates to the Iris classification task
At the outset, the following is a list of salient aspects of supervised learning:
- The term supervised in supervised learning stems from the fact that the algorithm is learning or inferring what the mapping function is.
- A data analysis task, either classification or regression.
- It contains a process of learning or inferring a mapping function from a labeled training dataset.
- Our Iris training dataset has training examples or samples, where each example may be represented by an input feature vector consisting of four measurements.
- A supervised learning algorithm learns or infers or derives the best possible approximation of a mapping function by carrying out a data analysis on the training data. The mapping function is also known as a model in statistical or ML terminology.
- The algorithm provides our model with parameters that it learns from the training example set or training dataset in an iterative process, as follows:
- Each iteration produces predicted class labels for new input instances from the wild
- Each iteration of the learning process produces progressively better generalizations of what the output class label should be, and as in anything that has an end, the learning process for the algorithm also ends with a high degree of reasonable correctness on the prediction
- An ML classification process employing supervised learning has algorithm samples with correctly predetermined labels.
- The Iris dataset is a typical example of a supervised learning classification process. The term supervised arises from the fact that the algorithm at each step of an iterative learning process applies an appropriate correction on its previously generated model building process to generate its next best model.
In the next section, we will define a training dataset. In the next section, and in the remaining sections, we will use the Random Forest classification algorithm to run data analysis transformation tasks. One such task worth noting here is a process of transformation of string labels to an indexed label column represented by doubles.
Random Forest classification algorithm
In a preceding section, we noted the crucial role played by the input or training dataset. In this section, we reiterate the importance of this dataset. That said, the training dataset from an ML algorithm standpoint is one that the Random Forest algorithm takes advantage of to train or fit the model by generating the parameters it needs. These are parameters the model needs to come up with the next best-predicted value. In this chapter, we will put the Random Forest algorithm to work on training (and testing) Iris datasets. Indeed, the next paragraph starts with a discussion on Random Forest algorithms or simply Random Forests.
A Random Forest algorithm encompasses decision tree-based supervised learning methods. It can be viewed as a composite whole comprising a large number of decision trees. In ML terminology, a Random Forest is an ensemble resulting from a profusion of decision trees.
A decision tree, as the name implies, is a progressive decision-making process, made up of a root node followed by successive subtrees. The decision tree algorithm snakes its way up the tree, stopping at every node, starting with the root node, to pose a do-you-belong-to-a-certain-category question. Depending on whether the answer is a yes or a no, a decision is made to travel up a certain branch until the next node is encountered, where the algorithm repeats its interrogation. Of course, at each node, the answer received by the algorithm determines the next branch to be on. The final outcome is a predicted outcome on a leaf that terminates.
Speaking of trees, branches, and nodes, the dataset can be viewed as a tree made up of multiple subtrees. Each decision at a node of the dataset and the decision tree algorithm's choice of a certain branch is the result of an optimal composite of feature variables. Using a Random Forest algorithm, multiple decision trees are created. Each decision tree in this ensemble is the outcome of a randomized ordering of variables. That brings us to what random forests are—an ensemble of a multitude of decision trees.
It is to be noted that one decision tree by itself cannot work well for a smaller sample like the Iris dataset. This is where the Random Forest algorithm steps in. It brings together or aggregates all of the predictions from its forest of decision trees. All of the aggregated results from individual decision trees in this forest would form one ensemble, better known as a Random Forest.
We chose the Random Forest method to make our predictions for a good reason. The net prediction formed out of an ensemble of predictions is significantly more accurate.
In the next section, we will formulate our classification problem, and in the Getting started with Spark section that follows, implementation details for the project are given.