The Iris dataset
The Iris dataset is a classic dataset from the 1930s; it is one of the first modern examples of statistical classification.
The setting is that of Iris flowers, of which there are multiple species that can be identified by their morphology. Today, the species would be defined by their genomic signatures, but in the 1930s, DNA had not even been identified as the carrier of genetic information.
The following four attributes of each plant were measured:
Sepal length
Sepal width
Petal length
Petal width
In general, we will call any measurement from our data as features.
Additionally, for each plant, the species was recorded. The question now is: if we saw a new flower out in the field, could we make a good prediction about its species from its measurements?
This is the supervised learning or classification problem; given labeled examples, we can design a rule that will eventually be applied to other examples. This is the same setting that is used for spam classification; given the...