The curse of dimensionality
Most clustering algorithms depend upon the distances between points in the data space. But it is a fact of Euclidean geometry that average distances grow as the number of dimensions increases.
For example, look at the unit hypercube:
The one-dimensional hypercube is the unit interval [0,1]. The two points that are farthest apart in this set are 0 and 1, whose distance d(0,1) = 1.
The two-dimensional hypercube is the unit square. The two points that are farthest apart in H2 are the corner points 0 = (0,0) and x = (1,1), whose distance is .
In Hn, the two corner points 0 = (0, 0, …, 0) and x = (1, 1, …, 1) are at the distance .
Not only do points tend to be farther apart in higher-dimensional space, but also their vectors tend to be perpendicular. To see that, suppose x =(x1,…,xn) and y = (y1,…,yn) are points in . Recall that their dot product (also called the scalar product) is . But we also have this formula from the Law of Cosines: , where θ is the angle between...