Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
scikit-learn Cookbook - Second Edition

You're reading from  scikit-learn Cookbook - Second Edition

Product type Book
Published in Nov 2017
Publisher Packt
ISBN-13 9781787286382
Pages 374 pages
Edition 2nd Edition
Languages
Author (1):
Trent Hauck Trent Hauck
Profile icon Trent Hauck
Toc

Table of Contents (13) Chapters close

Preface 1. High-Performance Machine Learning – NumPy 2. Pre-Model Workflow and Pre-Processing 3. Dimensionality Reduction 4. Linear Models with scikit-learn 5. Linear Models – Logistic Regression 6. Building Models with Distance Metrics 7. Cross-Validation and Post-Model Workflow 8. Support Vector Machines 9. Tree Algorithms and Ensembles 10. Text and Multiclass Classification with scikit-learn 11. Neural Networks 12. Create a Simple Estimator

Viewing the iris dataset

Now that we've loaded the dataset, let's examine what is in it. The iris dataset pertains to a supervised classification problem.

How to do it...

  1. To access the observation variables, type:
iris.data

This outputs a NumPy array:

array([[ 5.1,  3.5,  1.4,  0.2],
[ 4.9, 3. , 1.4, 0.2],
[ 4.7, 3.2, 1.3, 0.2],
#...rest of output suppressed because of length
  1. Let's examine the NumPy array:
iris.data.shape

This returns:

(150L, 4L)

This means that the data is 150 rows by 4 columns. Let's look at the first row:

iris.data[0]

array([ 5.1, 3.5, 1.4, 0.2])

The NumPy array for the first row has four numbers.

  1. To determine what they mean, type:
iris.feature_names
['sepal length (cm)',

'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']

The feature or column names name the data. They are strings, and in this case, they correspond to dimensions in different types of flowers. Putting it all together, we have 150 examples of flowers with four measurements per flower in centimeters. For example, the first flower has measurements of 5.1 cm for sepal length, 3.5 cm for sepal width, 1.4 cm for petal length, and 0.2 cm for petal width. Now, let's look at the output variable in a similar manner:

iris.target

This yields an array of outputs: 0, 1, and 2. There are only three outputs. Type this:

iris.target.shape

You get a shape of:

(150L,)

This refers to an array of length 150 (150 x 1). Let's look at what the numbers refer to:

iris.target_names

array(['setosa', 'versicolor', 'virginica'],
dtype='|S10')

The output of the iris.target_names variable gives the English names for the numbers in the iris.target variable. The number zero corresponds to the setosa flower, number one corresponds to the versicolor flower, and number two corresponds to the virginica flower. Look at the first row of iris.target:

iris.target[0]

This produces zero, and thus the first row of observations we examined before correspond to the setosa flower.

How it works...

In machine learning, we often deal with data tables and two-dimensional arrays corresponding to examples. In the iris set, we have 150 observations of flowers of three types. With new observations, we would like to predict which type of flower those observations correspond to. The observations in this case are measurements in centimeters. It is important to look at the data pertaining to real objects. Quoting my high school physics teacher, "Do not forget the units!"

The iris dataset is intended to be for a supervised machine learning task because it has a target array, which is the variable we desire to predict from the observation variables. Additionally, it is a classification problem, as there are three numbers we can predict from the observations, one for each type of flower. In a classification problem, we are trying to distinguish between categories. The simplest case is binary classification. The iris dataset, with three flower categories, is a multi-class classification problem.

There's more...

With the same data, we can rephrase the problem in many ways, or formulate new problems. What if we want to determine relationships between the observations? We can define the petal width as the target variable. We can rephrase the problem as a regression problem and try to predict the target variable as a real number, not just three categories. Fundamentally, it comes down to what we intend to predict. Here, we desire to predict a type of flower.

You have been reading a chapter from
scikit-learn Cookbook - Second Edition
Published in: Nov 2017 Publisher: Packt ISBN-13: 9781787286382
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime