For this problem, we will use the k-NN algorithm—k here means that we will look at k-closest neighbors. Given a white point, it will be classified as an area of water if the majority of its k-closest neighbors are in an area of water and classified as land if the majority of its k-closest neighbors are an area of land. We will use the Euclidean metric for the distance: given two points, X=[x0,x1] and Y=[y0,y1], their Euclidean distance is defined as dEuclidean = sqrt((x0-y0)2+(x1-y1)2).
The Euclidean distance is the most common metric. Given two points on a piece of paper, their Euclidean distance is just the length between the two points, as measured by a ruler, as shown in the following diagram:
To apply the k-NN algorithm to an incomplete map, we have to choose the value of k. Since the resulting class of a point is the class of the majority of the k-closest neighbors of that point, k should be odd. Let's apply this algorithm to the values of k=1,3,5,7,9.
Applying this algorithm to every white point on the incomplete map will result in the following complete maps:
k=1
k=3
k=5
k=7
Â
k=9
As you may have noticed, the highest value of k results in a complete map with smoother boundaries. An actual complete map of Italy is shown here:
We can use this real, complete map to calculate the percentage of incorrectly classified points for the various values of k to determine the accuracy of the k-NN algorithm for different values of k:
k
|
Precentage of incorrectly classified points
|
1
|
2.97
|
3
|
3.24
|
5
|
3.29
|
7
|
3.40
|
9
|
3.57
|
Â
Thus, for this particular type of classification problem, the k-NN algorithm achieves the highest accuracy (least error rate) for k=1.
However, in real life, the problem is that we wouldn't usually have complete data or a solution. In such scenarios, we need to choose a value of k that is appropriate to the data that is partially available. For this, consult Problem 4 at the end of this chapter.