Packt+ | Advance your knowledge in tech

You're reading from Data Science Algorithms in a Week Top 7 algorithms for scientific computing, data analysis, and machine learning

Product type Paperback

Published in Oct 2018

Publisher Packt

ISBN-13 9781789806076

Length 214 pages

Edition 2nd Edition

Languages

Python

Tools

Devise

Concepts

Data Analysis

Authors (2):

David Toth

David Natingga

View More author details

For this problem, we will use the k-NN algorithm—k here means that we will look at k-closest neighbors. Given a white point, it will be classified as an area of water if the majority of its k-closest neighbors are in an area of water and classified as land if the majority of its k-closest neighbors are an area of land. We will use the Euclidean metric for the distance: given two points, X=[x₀,x₁] and Y=[y₀,y₁], their Euclidean distance is defined as d_Euclidean = sqrt((x₀-y₀)²+(x₁-y₁)²).

The Euclidean distance is the most common metric. Given two points on a piece of paper, their Euclidean distance is just the length between the two points, as measured by a ruler, as shown in the following diagram:

To apply the k-NN algorithm to an incomplete map, we have to choose the value of k. Since the resulting class of a point is the class of the majority of the k-closest neighbors of that point, k should be odd. Let's apply this algorithm to the values of k=1,3,5,7,9.

Applying this algorithm to every white point on the incomplete map will result in the following complete maps:

k=1

k=3

k=5

k=7

k=9

As you may have noticed, the highest value of k results in a complete map with smoother boundaries. An actual complete map of Italy is shown here:

We can use this real, complete map to calculate the percentage of incorrectly classified points for the various values of k to determine the accuracy of the k-NN algorithm for different values of k:

k	Precentage of incorrectly classified points
1	2.97
3	3.24
5	3.29
7	3.40
9	3.57

Thus, for this particular type of classification problem, the k-NN algorithm achieves the highest accuracy (least error rate) for k=1.

However, in real life, the problem is that we wouldn't usually have complete data or a solution. In such scenarios, we need to choose a value of k that is appropriate to the data that is partially available. For this, consult Problem 4 at the end of this chapter.

You're reading from Data Science Algorithms in a Week Top 7 algorithms for scientific computing, data analysis, and machine learning

Table of Contents (12) Chapters

Map of Italy example – choosing the value of k

Analysis

Authors (2)

Other recommended products

Personalised recommendations for you

You're reading from Data Science Algorithms in a Week Top 7 algorithms for scientific computing, data analysis, and machine learning

Table of Contents (12) Chapters

Unlock this book and the full library FREE for 7 days

Authors (2)

Other recommended products

Personalised recommendations for you