Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Modern Scala Projects

You're reading from   Modern Scala Projects Leverage the power of Scala for building data-driven and high performance projects

Arrow left icon
Product type Paperback
Published in Jul 2018
Publisher Packt
ISBN-13 9781788624114
Length 334 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Ilango gurusamy Ilango gurusamy
Author Profile Icon Ilango gurusamy
Ilango gurusamy
Arrow right icon
View More author details
Toc

Project overview – problem formulation

The intent of this project is to develop an ML workflow or more accurately a pipeline. The goal is to solve the classification problem on the most famous dataset in data science history. 

If we saw a flower out in the wild that we know belongs to one of three Iris species, we have a classification problem on our hands. If we made measurements (X) on the unknown flower, the task is to learn to recognize the species to which the flower (and its plant) belongs.

Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational level. While the latter two variables may also be considered in a numerical manner by using exact values for age and highest grade completed, it is often more informative to categorize such variables into a relatively small number of groups.

Analysis of categorical data generally involves the use of data tables. A two-way table presents categorical data by counting the number of observations that fall into each group for two variables, one divided into rows and the other divided into columns. 

In a nutshell, the high-level formulation of the classification problem is given as follows:

High-level formulation of the Iris supervised learning classification problem
In the Iris dataset, each row contains categorical data (values) in the fifth column. Each such value is associated with a label (Y). 

The formulation consists of the following:

  • Observed features 
  • Category labels 

Observed features are also known as predictor variables. Such variables have predetermined measured values. These are the inputs X. On the other hand, category labels denote possible output values that predicted variables can take. 

The predictor variables are as follows:

  • sepal_length: It represents sepal length, in centimeters, used as input
  • sepal_width: It represents sepal width, in centimeters, used as input
  • petal_length: It represents petal length, in centimeters, used as input

  • petal_width: It represents petal width, in centimeters, used as input
  • setosa: It represents Iris-setosa, true or false, used as target
  • versicolour: It represents Iris-versicolour, true or false, used as target
  • virginica: It represents Iris-virginica, true or false, used as target

Four outcome variables were measured from each sample; the length and the width of the sepals and petals.

The total build time of the project should be no more than a day in order to get everything working. For those new to the data science area, understanding the background theory, setting up the software, and getting to build the pipeline could take an extra day or two.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime